Science topic

Clustering Algorithms - Science topic

Explore the latest questions and answers in Clustering Algorithms, and find Clustering Algorithms experts.
Questions related to Clustering Algorithms
  • asked a question related to Clustering Algorithms
Question
11 answers
I have the same dataset for k-mean clustering. I have applied same algorithm on the same dataset using two different tool (weka and rapidminer). I have got different cluster? Which one should I use? Your suggestions are welcome?
Relevant answer
Answer
Dear Emil
running K-means Clustering several times is a good first approach.
Taking the "best" result in the end is a fine strategy.
But consider to widen the approach from just comparing "best" results on the summed and squared differences between data and clusters to some quality criterion like:
Sihouette Coefficient, Gap-Statistics, Calinski-Harabasz index, Dunn index, Davies-Bouldin index, and others that have been mentioned above
and then decide to use one of those, and explicitly optimize all meta parameters (like no of patterns, initial K, ...) w.r.t this index.
For image compression i had fruitful results using the Calinski-Harabasz index.
regards
Nils
  • asked a question related to Clustering Algorithms
Question
8 answers
Kindly give suggestions to handle ordinal categorial dataset using clustering algorithm
Relevant answer
Луценко Е.В., Подсистема агломеративной когнитивной кластеризации классов системы «Эйдос» ("Эйдос-кластер"). Пат. № 2012610135 РФ. Заяв. № 2011617962 РФ 26.10.2011. Опубл. От 10.01.2012. – Режим доступа: http://lc.kubagro.ru/aidos/2012610135.jpg, 3,125 у.п.л.
Луценко Е.В. Метод когнитивной кластеризации или кластеризация на основе знаний (кластеризация в системно-когнитивном анализе и интеллектуальной системе «Эйдос») / Е.В. Луценко, В.Е. Коржаков // Политематический сетевой электронный научный журнал Кубанского государственного аграрного университета (Научный журнал КубГАУ) [Электронный ресурс]. – Краснодар: КубГАУ, 2011. – №07(071). С. 528 – 576. – Шифр Информрегистра: 0421100012\0253, IDA [article ID]: 0711107040. – Режим доступа: http://ej.kubagro.ru/2011/07/pdf/40.pdf, 3,062 у.п.л.
  • asked a question related to Clustering Algorithms
Question
2 answers
Specifically, what is the difference between
A) POPDATA=1, LOCPRIOR=1, LOCDATA=0, LOCISPOP=1 and
B) POPDATA=0, LOCPRIOR=1, LOCDATA=1, LOCISPOP=0 ?
For the clustering algorithm and calculating Q, I think it is the same. In both cases, the Locprior model uses the sampling location information, only from formally different sources (POPDATA or LOCDATA column). The further difference is that the A settings generate Q values not only for individuals but also for pre-defined populations.
Am I right?
Thank you.
Relevant answer
Answer
Thank you. As mentioned in the question, I am using Structure (version 2.3).
  • asked a question related to Clustering Algorithms
Question
9 answers
I Want to use this algorithm for grouping users based on similarities.
Relevant answer
Answer
Have you managed to implement it?
  • asked a question related to Clustering Algorithms
Question
4 answers
I have a longitudinal dataset for 500 sensors reading at 25 time point.
I want to cluster those sensor readings at each time point , from T1 to T25
bur I want to know which cluster each sensor lies in at each time point in easier manner.
Assuming I have found 3 clusters in T1, and 2 in T2, is there a way to measure if sensors in cluster 1 T1 significantly similar to sensors in T2 cluster 1 and 2?
Another question, which machine learning clustering algorithm are good for my case? which one don’t require hyper parameter tuning ( like maybe defining the number of cluster ) ?
last, any visualisation method to print clusters per sensors by time ?
Relevant answer
Answer
Thank you again dear Inès François
I didn't use sequence plot previously.. have you tried that?
I am asking because i want to know if i can extract the used colours in this plot ( which will be the aimed states in my case) .. or i have to re-cluster the images of these plots and utilize from get-colour function to identify the used colours ..
Actually, i thought about that since i might end up in more than 5 time series clusters ..
So, in this case does sequence plots use same colour for the number that will be plotted? or each plot will be populated with different colour for each number?
  • asked a question related to Clustering Algorithms
Question
1 answer
I have a set of 2000 sequences. In that set there are two sequences
>Seq1
AAAAAAAAAAAAAAAAAAAAAAA
>Seq2
UUUUUUUUUUUUUUUUUUUUU
When clustering with Cd-hit-est with default options, these two are put in same cluster with Seq1 as representative and Seq2 as.... -/100%
Is this correct?
Relevant answer
Answer
Angana Ray Yep, you are accurate. Cd-hit-est employs a word counting method to group sequences with a high degree of similarity depending on a user-specified identity threshold. The identity criterion is set to 90% by default, which indicates that sequences with 90% or higher similarity will be grouped together. In your example, Seq1 and Seq2 have no nucleotide sequence similarity, thus they are clustered together as a singleton cluster with Seq1 as the "typical" sequence and a similarity score of -/100%. This score denotes that the sequence bears no resemblance to any other sequence in the cluster.
  • asked a question related to Clustering Algorithms
Question
5 answers
We are applying k-means clustering algorithm on an unlabeled data. Our aim is to at the end, pool a result that shows two possibility. Is it necessary then to carry out k-nn classification after the clustering?
Relevant answer
I think clustering is one way of classifying. Therefore, after clustering, some additional classification is not required.
  • asked a question related to Clustering Algorithms
Question
5 answers
I currently use .csv files to work with pandas dataframes and perform UMAP analyses and I would like to use Scanpy moving forward. Can anyone help me with converting .csv files into Anndata files for Scanpy?
Relevant answer
Answer
Myles Joshua Toledo Tan Converting a.csv file to an AnnData file for use in Scanpy is a simple procedure. Here's an example of how it may be done:
1. You must first install the anndata package, which can be done by typing pip install anndata into your command line.
2. Following that, import the relevant libraries, such as pandas and anndata.
import pandas as pd
import anndata
3. Then, using the pd.read csv() method, read your.csv file into a pandas DataFrame.
data = pd.read_csv("your_file.csv")
4. After that, you can use the anndata.AnnData() method to convert the DataFrame to an AnnData object.
adata = anndata.AnnData(data)
5. Finally, you may use the scanpy library's different methods to conduct any extra processing or analysis on your AnnData object.
It's worth noting that when you convert a dataframe to an AnnData object, it thinks the rows are observations and the columns are variables. If you have the inverse, use.T to transpose the dataframe.
You can alternatively use scanpy.read csv() to directly import the csv file into an AnnData object to get the same result.
adata = sc.read_csv("your_file.csv")
Please let me know if there is anything else I can do for you.
  • asked a question related to Clustering Algorithms
Question
5 answers
I am planning to conduct a cluster analysis of EEG PSD values and respective synchronicity estimates (e.g. alpha/theta) using deep learning methods. I am still in the literature review stage and I would like to know the latest deep learning algorithms used for EEG clustering.
Relevant answer
Answer
You can check this publication where I use MLP (DL). Adagrad give good result .
@article{islam2022explainable, title={Explainable machine learning methods for classification of brain states during visual perception}, author={Islam, Robiul and Andreev, Andrey V and Shusharina, Natalia N and Hramov, Alexander E}, journal={Mathematics}, volume={10}, number={15}, pages={2819}, year={2022}, publisher={MDPI} }
  • asked a question related to Clustering Algorithms
Question
7 answers
I have i have I havea dataset likethat :
users T1 T2 … Tn
1 [1,2,1.5] [1,3,3] … [2,2,6]
2 [1,5,1.5] [1,3,4] … [2,8,6]
.
n [1,5,7.5] [5,3,4] … [2,9,6]
Given that lists are distinct incident change by time.
My aim to find distinct incidents which might happen to users by time.
I thought of feeding the full dataset to clustering algorithms , but I need an advice from you about best algorithms to fit such 2D dataset or best approach to follow in solving this problem
Relevant answer
Answer
K-MDTSC: K-Multi-Dimensional Time-Series Clustering Algorithm.
  • asked a question related to Clustering Algorithms
Question
1 answer
I want to learn and include the temporal relationship alongside my variables at each time series instance.
And I learnt that previously they used models to represent all variables by a single variable or reduce the dimensionality of variables.
I read also they created a variable for time indexing which include first difference values of time, ex. for first instance its T2-T1, and i want to know more techniques to include temporal relationship as a variable or by any means before feeding my dataset to clustering algorithms. Do you know any other techniques to represent this as a feature or by transforming existing feature to include temporal/spatial pattering , or what they call it inter/intra patterns?
Relevant answer
Answer
If you are asking for Brain rs-fMRI,
1. Neuronal activation patterns between brain regions
2. dynamic Functional Connectivity(dFC) network to record the temporal changes
  • asked a question related to Clustering Algorithms
Question
8 answers
k-means clustering code in C++
Relevant answer
Answer
  • asked a question related to Clustering Algorithms
Question
3 answers
For example, suppose there are 50 data points which need to be clustered into seven parts, in a way so that each cluster contains at least five and at most 10 data points.
Relevant answer
Answer
The maximum possible number of clusters will be equal to the number of observations in the dataset.
  • asked a question related to Clustering Algorithms
Question
2 answers
I understand that one of the methods to initialize the k medoids is by finding the minimum sum of the distance to every data point. In terms of the SWAP stage, some resources stop the iteration after a predefined number of iterations. However, the original PAM algorithm stops iterating when the Total Deviation (SSE)is minimized.
So, could you please confirm my method of doing the SWAP stage?
I sort the data points from the lowest sum of the distance to the highest. If the K is set to 3 then the first three data points are the initial medoids.
{M3, M1, M5, x0, x2, x7, x9, x8, x6, x4}
In the SWAP stage, first I calculate the Total Deviation (TD) for the initial medoids and set it aside for comparison.
Then I calculate the TD when swapping M3 with x0, x2, x7, x9, x8, x6, x4 and stop this process when the previous TD (within the loop) is smaller than the current TD. And I set the outcome of this process aside for comparison.
I repeat the last process for M1 and M5 and set the outcomes aside for comparison.
Finally, I will end up with four TD; one from the initial medoids and the other three from checking the initial medoids against the non-medoids data points. So, I just pick the outcome of the smallest TD.
Is this the correct way to do the SWAP stage?
Relevant answer
Answer
PAM stands for “partition around medoids”. The algorithm is intended to find a sequence of objects called medoids that are centrally located in clusters.
Regards,
Shafagat
  • asked a question related to Clustering Algorithms
Question
4 answers
how to made combination of optimization algorithms and clustering algorithms ?
Relevant answer
Answer
Select representative features using cluster firstly and then use the optimization algorithm to fit and solve the classification problem using the selected features
  • asked a question related to Clustering Algorithms
Question
2 answers
i do research about image clustering and i need a proper data base for image clustering
Relevant answer
Answer
Hi,
Depends on the your algorithm and your goal, there may be lots of datasets to work on. Besides, you can take any image classification dataset (like MNIST) and simply ignore its labels, then set your goals to put images with the same label in a same group. Also, I find these two datasets a good starting point.
  • asked a question related to Clustering Algorithms
Question
2 answers
Hello,
I would like to cluster my dataset, which contains approximately 35,000 protein sequences. Be that as it may, I need to make some clusters in terms of their superfamilies. Thus, I would like to set the number of clusters; for example, we can specify the number of clusters in the K-Means cluster algorithm.
Can I define the number of clusters (i.e., 10, 50, 250) using the MMseqs2/Linclust tool? Thank you!
Relevant answer
I certainly avoid answering inquiries if I have vague notions regarding a particular topic. I always read a question before responding so that people do not think I am an imbecile.
  • asked a question related to Clustering Algorithms
Question
5 answers
Hi, I am working on UAV deployment in a wireless communication scenario to cache. Does anybody know how to simulate/deploy UAVs in Matlab using weighted or simple k-mean clustering? Below I also attached a snapshot.
Thank you
Relevant answer
Answer
Hi Imad,
Here are three links on the subject that will interest you, I think the first link provides a thesis, despite its date 2010 remains current.
Communication among UAVs Thesis in Computer Engineering Jun 2010
Chaoyou Dai Yifei Li Weiming Zhai
Here is a non-exhaustive list of tools can be used as UTSim, FlyNetSim, UAV Toolbox MathWorks including Matlab…
For more details about this subject i suggest you to see links and attached flile on topic.
Article Unmanned Aerial Vehicle Propagation Datalink Tool Based on a...
Best regards
  • asked a question related to Clustering Algorithms
Question
2 answers
Hi,
I have several .fastq files from Nanopore sequencing technology. I have already done trimming and filtering. I was also able to do clustering using isONclust. Nevertheless, I have the following concern. How do I generate a single "otu table" with the clustering files' outputs separated by samples? In short, I run the following commands, but I do not know how to merge everything into one file.
sONclust --fastq sample1_filter.fastq --ont --medaka --outfolder sample1_clust/
isONclust --fastq sample2_filter.fastq --ont --medaka --outfolder sample2_clust/
isONclust --fastq sample3_filter.fastq --ont --medaka --outfolder sample3_clust/
Relevant answer
Answer
Thanks a lot for your answer. I already went through the paper and tutorial, but unfortunately, I did not find the information i need.
  • asked a question related to Clustering Algorithms
Question
2 answers
Hello,
I'm looking working on a clustering analysis and would be curious if anyone has ideas about how to deal with nested categorical variables.
Normally I would calculate a distance/dissimilarity matrix (Gower when some variables are categorical), and then feed this to a clustering algorithm of choice. Now what happens when some categorical variables are nested?
Fictious example
If measuring characteristics of water samples like turbidity, temperature, dissolved gases, and presence/absence of 50 chemical compounds in the water.
* presence/absence of chemical compounds can be treated as 50 separate binary/categorical variables
* but say that these chemicals belong to 4 groups of compounds?
Thoughts
We could simply add an additional categorical variable "group" and for more complex nesting "subgroup", "subsubgroup"... OK, but as far as I understand, Gower distance is a bit like Manhattan distance in that it calculates a distance for each variable and then adds weights. What but part of the information will be redundant, and even more so if there are more levels of nesting. I was wondering whether anyone has come up with something else to specifically deal with that. Maybe some form of weighting of the variables?
Looking forward to your inputs!
Mick
Relevant answer
Answer
Thank you for taking the time to reply Muhammad Ali .
I looked at the linked resources but do not see anything related to my question (*nested categorical* variables, not simple categorical variables). In case I missed it, could you please indicate the relevant section?
Kind regards,
Mick
  • asked a question related to Clustering Algorithms
Question
3 answers
I am trying to cluster a few categorical & continuous observations using an unsupervised cluster algorithm. I have used two-step clustering since I have categorical predictors! Could someone help me with algorithms to compute predictor importance in unsupervised learning?
Relevant answer
Answer
Muhammad Ali Thank you for your response. I couldn't find a link on unsupervised clustering, though!
  • asked a question related to Clustering Algorithms
Question
4 answers
I need to execute a clustering algorithm in RPL protocol using the COOJA simulator, Please could you provide me with a clustering code or any information? How i can create the clusters and choose cluster heads from clusters? And how to communicate between sensor nodes and sink via cluster heads? Please give suggestions?
Relevant answer
Answer
Thank you Shafagat Mahmudova for your cooperation.
  • asked a question related to Clustering Algorithms
Question
3 answers
I am developing a dataset with scientific articles. The dataset is a matrix where the rows are the different articles and the columns are the keywords extracted from the articles themselves. The value of each cell is the relevance of each word for each article normalized in L2.
I want to use clustering techniques to find out groups of articles based on the keywords extracted but the sparsity of the dataset is huge as you can imagine. However, the matrix itself is not sparse, as the zeroes mean that a word has no relevance in an article.
Having this in mind, I used UMAP for dimensionality reduction to visualize the dataset but I am not sure when to use the clustering algorithm. Should I use it before I reduced the dataset or after I reduced it?
I have seen in other posts that it is better to first apply the dimensionality reduction and then use the clustering algorithm but I'm not sure if this affects the output of the clustering.
Relevant answer
Answer
Roberto Mancebo Dimension reduction is crucial in cluster analysis since it results in less data storage while retaining the same analytical outcomes as the original representation. To achieve an efficient processing time when clustering and to reduce the curse of dimensionality, a clustering algorithm requires data reduction.
Also, take a look at:
  • asked a question related to Clustering Algorithms
Question
9 answers
I work with extremely short texts which span over 1 to 7 words. I have tried simple embeddings based clustering using simple clustering algorithms but the results are not satisfactory. Types of embeddings I have used are fasttext, glove etc. So I was wondering if there is any special work for such specific short texts to cluster and learn their underlying structures?
Please feel free to provide anything ranging from a blog to a code to a proper scientific paper.
Relevant answer
Answer
Maybe one idea is to use character embedding instead of word embedding. Higher-level embeddings also could be useful such as morpheme level or unsupervised subword tokenization.
  • asked a question related to Clustering Algorithms
Question
4 answers
I have several raster files and I need to perform a comparative analysis between them. In this context, a cluster analysis could be a good way to explore the possible grouping between raster files. Is there an R or Matlab scrip for this goal?
Relevant answer
Answer
Thanks Aravinda C V , however, I insist that I am looking for a script that provides a way to perform a cluster analysis between "raster files"
  • asked a question related to Clustering Algorithms
Question
6 answers
For unsupervised text clustering, the key thing is the init embedding for text.
If we want to use https://github.com/facebookresearch/deepcluster for text, the problem for text is how to get the init embedding from deep model.
BERT can not get good init embedding.
If we do not use deep model, is there better way to get embedding better than glove wordvec?
Thank you very much.
Relevant answer
Answer
Dear Tong Guo
In the following paper,
a new embedding technique based on deep learning for text clustering has been proposed.
  • asked a question related to Clustering Algorithms
Question
4 answers
In my experience, pre-trained models are not suitable for unsupervised tasks. Especially in deep clustering when pre-trained models are used, they often have worse results than without pre-trained models.
What is the scientific reason for this?
Why are the learned representations in pre-trained models not suitable for clustering?
Relevant answer
Answer
Dear Amin,
This is a good question. The below link describes very well the solid definition of transfer learning and mentioned why and when we should use a pre-trained model
"A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel."
  • asked a question related to Clustering Algorithms
Question
5 answers
I'm a PGRs student, I would like to implement a new clustering algorithm using NS3, I have basic information in programming skills and I read about the object oriented c++ and NS3, I tried to create my network and deploy the node and the sink, is anyone has a good idea on how I can do this is NS3 or how I can start this, i'm so confused and do not know what can one do. Thank you for helpingv.
Relevant answer
Answer
Did you find anything that can help me, too?
  • asked a question related to Clustering Algorithms
Question
11 answers
Hello everyone,
Could you recommend papers, books or websites about unsupervised neural networks?
Thank you for your attention and valuable support.
Regards,
Cecilia-Irene Loeza-Mejía
  • asked a question related to Clustering Algorithms
Question
4 answers
Hello to everyone. I am trying to implement KNN analysis to fix minPts in the DBSCAN clustering algorithm. My dataset is composed only of 4 variables and 935 observations. I have found that if k = 5 (no. of variables + 1) I get as output of DBASCAN 2 clusters: one of 911 observations and one of 8 observations. If I use a larger k, according to many papers as sqrt(no. of observations), I get 909 observation in only one cluster and the other are classified as noise points.
Both could be possible results, but their meaning is fundementally different. How can I get rid of this arbitrary choise of minPts hence k?
Thanks!
Relevant answer
Answer
the boundary becomes smoother with increasing value of K. The training error rate and the validation error rate are two parameters we need to access different K-value.
To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.
The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but you must be aware of the outliers.
  • asked a question related to Clustering Algorithms
Question
3 answers
Any suggestions please
Relevant answer
Answer
There is no solution for your problem. Either change your clustering strategy or run the process on super/cluster computer which has terrabytes of RAM.
  • asked a question related to Clustering Algorithms
Question
5 answers
I want to develop an ensemble approach where the final layer of a CNN model(Flatten layer in this case) will be followed by a K-Means Clustering algorithm where I want to cluster inputs into a number of categories same as required number of categories in a task. I want help regarding how to apply K-Means Clustering with a CNN.
Relevant answer
Answer
If you do a classification task you could just use both classification algorithm k-means and CNN to classify then you'll be more confident about your classification (even better if you use more than just two methods)
  • asked a question related to Clustering Algorithms
Question
4 answers
Hello Researcher, If we have ensemble model which apply different Association mining algorithms such as Apriori, Terius and Clustering Algorithms such as K-Means, Cobweb on same dataset, I want to analyse which algorithm is better among all. Is there any automated simple way to compare the performance of above Association mining and Clustering algorithms based on accuracy , time etc in Weka tool?
  • asked a question related to Clustering Algorithms
Question
4 answers
Dear researchers,
I want to apply clustering tasks for city traffic time series for a research in Intelligent Transportation Systems.
Can you recommend a city traffic dataset that contains the speed, flow, or occupancy measurements for roads and also contains the type of traffic pattern for each road? This type can be provided as a class label for each road, or a class label for each day measurements on each road (a class label for the whole time series).
Any help is much appreciated, thanks in advance.
Relevant answer
Answer
Hi Muhammad Zahid Khattak , I have this kind of data.
Write me and we will discuss about the possibility to make some experiments together.
  • asked a question related to Clustering Algorithms
Question
5 answers
I want to evaluate the robustness of the clustering algorithm to noise ... How can I add noise to the data... Is there a well known method (such as salt and pepper in the image data)?
Relevant answer
Answer
Simply, you may make some values to empty.
  • asked a question related to Clustering Algorithms
Question
3 answers
I usually use Latent Dirichlet Allocation to cluster texts. What do you use? Can someone give a comparison between different text clustering algorithms?
Relevant answer
Answer
I typically have used k-means clustering algorithm which is very popular. This algorithm is based on partitioning. Similarly you can use clustering algorithms based on density or hierarchical clustering methods.
  • asked a question related to Clustering Algorithms
Question
7 answers
There is an idea to design a new algorithm for the purpose of improving the results of software operations in the fields of communications, computers, biomedical, machine learning, renewable energy, signal and image processing, and others.
So what are the most important ways to test the performance of smart optimization algorithms in general?
Relevant answer
Answer
I'm not keen on calling anything "smart". Any method will fail under some circumstances, such as for some outlier that no-one have thought of.
  • asked a question related to Clustering Algorithms
Question
10 answers
Hi
We have two time series dataset. For example mean daily temperature from two stations. Each of them have 30 data for a month.
T1={ t1 , t2 ,... , t30 }
T2={ t1 , t2 ,... , t30 }
Now we want to calculate the similarity between them. Some people may suggest correlation coefficient for this task but I think we use it when we consider the relation between datasets not similarity.
Is there any index to measuring the similarity?
Thanks
Relevant answer
Answer
Cross Recurrence or Joint Recurrence are the mot general metrics for correlating time series. The first is phhase dependent the other not.
  • asked a question related to Clustering Algorithms
Question
8 answers
dear researchers
I would be appreciative if you let me know your opinion about the disadvantages of the SOM clustering algorithm.
Relevant answer
Answer
The number of parameters, the values for the training parameters, the size and topology of the map, all have to be determined in advance. However, there are several settings to bring good results but they are time-consuming.
  • asked a question related to Clustering Algorithms
Question
3 answers
I want a matlab code for distributed energy-efficient clustering Protocol (DEEC) in wireless sensor network (WSN) with explanation please
thanks
Relevant answer
Answer
I do have that code with Python PL.
If your interested, please keep me posted.
Regards .
  • asked a question related to Clustering Algorithms
Question
5 answers
In this equation how do I choose p (prob of cluster head)? I need to know what value I have to choose.
Relevant answer
Answer
This is for homogeneous environment, what will be for heterogeneous environment?
  • asked a question related to Clustering Algorithms
Question
12 answers
Hi
I am trying to segment a sentinel2 image.
At this stage, I want to run a binary classifier that assigns each pixel to either farm or non-farm pixel. For this purpose, I have 4 10m bands including R/G/B/NIR. I also have generated an NDVI raster for each month (8 months in total) that has values ranging from -1 to 1 (it can be normalized to 0 to 255).
I am looking for a classifier that can accurately classify the pixels using NDVI and/or any combination of my 4 10m bands.
Thanks in advance.
Relevant answer
Answer
Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem. The big idea behind CNNs is that a local understanding of an image is good enough
Top 5 Classification Algorithms in Machine Learning
  • Logistic Regression.
  • Naive Bayes Classifier.
  • K-Nearest Neighbors.
  • Decision Tree. Random Forest.
  • Support Vector Machines.
  • asked a question related to Clustering Algorithms
Question
6 answers
I am conducting exploratory research about users on the Ethereum blockchain (I obtain the data from big query), and I would like to cluster the users, mostly by transactional features, for persona/archetype development.
However, the data is not normally distributed, many of the variables have a power-law distribution and some have no clear distribution pattern. It is very likely that I would like to include more than five variables.
Besides the question of what algorithm fits best, is it reasonable to normalize all variables (to a more normal distribution) and to perform a z-transformation?
Relevant answer
Answer
You may try this algorithm:
B. K. Tripathy and D. Mittal: Hadoop based uncertain possibilistic kernelized c-means algorithms for image segmentation and a comparative analysis, Applied soft computing, 46, (2016), pp.886-923.
  • asked a question related to Clustering Algorithms
Question
21 answers
Hi,
Whenever used Scikit-learn algorithm (sklearn.model_selection.train_test_split), is recommended to used the parameter ( random_state=42) to produce the same results across a different run.
why we used the integer (42)?
can we use another number?
thanks
Relevant answer
Answer
Yes, you can use a different number, It will produce a different outcome when compared to using 42, which can be used to evaluate your experiment in distinct scenarios.
Also, you are probably using '42' because of this (from wikipedia): The number 42 is, in The Hitchhiker's Guide to the Galaxy by Douglas Adams, the "Answer to the Ultimate Question of Life, the Universe, and Everything" =)
  • asked a question related to Clustering Algorithms
Question
11 answers
I know there are plenty evaluate methods can be used to evaluate the clustering result for a single data set, I am trying to apply the same clustering technique to two different data sets and then compare the similarity of the resulting clusters.
for example I want to compare the result of a same clustering algorithms on two consecutive time intervals with different data.
Relevant answer
Answer
Sara Mirzaie: I suggested one way in my second answer above. I'd be happy to discuss further via email or a direct message here :-)
  • asked a question related to Clustering Algorithms
Question
4 answers
In order to evaluate the performance of evolutionary clustering algorithms, there is internal and external evaluation metrics. Except this, is there any statistical measures to compare the clustering algorithms in terms of best, average, and worst performing values on a particular dataset.
Relevant answer
Answer
Performance/accuracy comparison is based on the similar task over similar data using different method.
  • asked a question related to Clustering Algorithms
Question
3 answers
Can heuristic or meta-heuristic fuzzy clustering algorithms help me? Any suggestions generally? I want to create learner’s profiles based on computational intelligence methods. The number of the groups (profiles) is unknown.
  • asked a question related to Clustering Algorithms
Question
1 answer
Does anyone know of a WORKING implementation of the original DENCLUE density based clustering algorithm, and NOT its extensions? I have explored a few from github but they ain't working.Moreover, If anyone is interested in a collaboration on this area, he/she is welcome.
Regards
  • asked a question related to Clustering Algorithms
Question
4 answers
Apart from this, suggest few benchmark datasets for high utility mining.
Relevant answer
Answer
These followings are the top search engine, need proper keywords, (and keep changing to reach suitable keywords):
Also, there are many more such as https://data.mendeley.com/
  • asked a question related to Clustering Algorithms
Question
10 answers
Think of a scenario where there's a dataset with 15 attributes(i.e. columns heading). I want to apply the clustering algorithm on that dataset but not taking all 15 attributes but taking any of the combinations of those 15 attributes(that's what I mean by dynamic attributes clustering). How can I build any model which could be able to determine the optimal number of clusters and do clustering for any combination of the attributes provided?
Relevant answer
Answer
Please find the answer below.
1- Your question:
How can I build any model which could be able to determine the optimal number of clusters and do clustering for any combination of the attributes provided?
Answer
Technically, you need to build a model that works on the selected feature subset. To do so, a search method like "BestFirst" must be applied to find the feature subset. Then, you need to apply an evaluator to assess the merit of the selected subset. Finally, invoke cluster algorithm like EM (expectation maximisation) to find the number fo cluster (Note that finding the number of clusters is not a trivial task by default, ut EM does this job using internal cross-validation.)
2- Your question:
actually the user will pick his preferred attributes and I want to apply clustering for all the possible preferences a user can have.
Answer
If you have a user to select specific features, then no need to have a search method, just apply EM directly on the selected features.
HTH.
Dr. Samer Sarsam
  • asked a question related to Clustering Algorithms
Question
3 answers
I found numerous internal evaluation applicable to DBSCAN. Some researchers said that there is no available internal evaluation for DBSCAN, while some papers utilized the indexes such as DBI, DUNN, or S_Dbw.
Can I use the classical indexes for the internal evaluation of DBSCAN? Or, are there any limitations or challenges to apply those metrics?
Relevant answer
Answer
In addition to the well-written paper that Sakib Shahriar suggested you could also have a look at that one:
  • asked a question related to Clustering Algorithms
Question
6 answers
I have already implemented clustering algorithm for WSN(IoT) based on remaining energies of CH's. But i need to improve algorithm for my academic project. Please suggest me the possible improvements.
Thanks
Relevant answer
Answer
  • asked a question related to Clustering Algorithms
Question
3 answers
Normalized Mutual Information (NMI) and B3 are used for extrinsic clustering evaluation metrics when each instance (sample) has only one label.
What are equivalent metrics when each instance (sample) has only one label?
For example, in first image, we see [apple, orange, pears], in second image, we see [orange, lime, lemon] and in third image, we see [apple], and in the forth image we see [orange]. Then, if put first image and last image in the one cluster it is good, and if put third and forth image in one cluster is bad.
Application: Many popular datasets for object detection or image segmentation have multi labels for each image. If we used this data for classification (not detection and not segmentation), we have multiple labels for each image.
Note: My task is unsupervised clustering, not supervised classification. I know that for supervised classification, we can use top-5 or top-10 score. But I do not know what will be in unsupervised clustering.
Relevant answer
Answer
  • asked a question related to Clustering Algorithms
Question
6 answers
Do we need to do feature scaling on each dimension of a multi-dimensional data distribution before applying K-Means clustering algorithm for it to be effective ?
Relevant answer
Answer
Yes, to make sure that your calculations will not be biased either to the very high or to the very low values. In other words, to make sure that all your data are at the same level. you could use any normalization technique to do this, and I recommend this:
Xi(new)= (Xi-mean(all X values))/standard deviation
  • asked a question related to Clustering Algorithms
Question
4 answers
Before starting on the K-Means clustering algorithm, is it advisable to convert the data to have zero mean and unit covariance.
Relevant answer
Answer
For K-means, data Normalization is not always required (however it always improves clustering), but it rarely hurts. Some examples: K-means: K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. See also : https://www.researchgate.net/post/Does_normalization_of_data_always_improve_the_clustering_results
  • asked a question related to Clustering Algorithms
Question
3 answers
I would like to know if there is any kind of research on the parameter selection (minpoints) for the HDBSCAN algorithm, in the context of having a training set with only normal samples and a test set that contains anomalies/outliers.
I want to apply the GLOSH algorithm to find the outliers but since i have a "clean" dataset available I suppose it is possible to exploit it.
I know that in the context of novelty detection exists lots of other algorithms (OC-SVM ecc...) but i would like to try solutions with HDBSCAN.
Relevant answer
Answer
Hello Davide
Do you have interest on feture selection or hyper parameter optimization?
regards
  • asked a question related to Clustering Algorithms
Question
5 answers
I would like to simulate a bio-inspired optimization based Geocast routing for VANET using a clustering algorithm in ns3 or Omnet++ with SUMO. And , evaluate my approach with other existing approaches.
I need source code of any geocast based routing protocol for VANET (such as: Gytar or Rover ) or any other position routing protocol (GPSR, GPCR) using a clustering algorithm. furthermore, I need to integrate an optimization algorithm such as ANT with in the routing protocol to deal with the optimal path selection.
I look forward to hearing from you.
thanks in advance.
  • asked a question related to Clustering Algorithms
Question
18 answers
Dear expert,
I am working on machine learning clustering algorithms for IoT sensor data fault detection and correction by considering non-spherical and unbalanced data in incremental way. Can you please suggest the best algorithm in this regard or best research article for sensor data fault detection and prevention.
Relevant answer
Answer
Dear
i suggest K-mean
  • asked a question related to Clustering Algorithms
Question
3 answers
I am using Hierarchical Aggromerative Clustering{HAC) and DBSCAN to find clusters in my data.
Please specify some validation technique to validate the results of clustering.
I have used Silhouette Score to validate the results of DBSCAN. Will it work fine with both clustering algorithm. I am using Python for implementation.
Relevant answer
Answer
use Cross-validation
  • asked a question related to Clustering Algorithms
Question
11 answers
I don't want a recommendation system.
I want to group similar users who select similar items .
I want to then cluster them and label those clusters.
Jaccard similarity is not efficient because I have millions of users and items.
Matrix factorization gives the recommendation model. I don't want to recommend any items. I just want to group them and label.
I don't know if k-means clustering works well for multiple feature vectors.
Does anybody know to implement this?
Relevant answer
Answer
I would recommend to also look into Boolean matrix factorization (BMF). k-means/modes is only usable if clusters have convex shapes and every point belongs to exactly one cluster. BMF can compute clusters with overlap (might not be necessarily important for your application) but it also identifies the features which are important for creating one cluster together with the data points belonging to one cluster. Especially if you have a high-dimensional feature space, it is unlikely that the similarity between points is expressed over the whole feature space. BMF identifies the feature space in which the points form a cluster.
If you are interested, look at Chapter 2 (particularly Section 2.4) from
  • asked a question related to Clustering Algorithms
Question
4 answers
Hi, I have a big GPS data of many users, my goal is to find key locations where users spent most of their time. So, basically I want to find key location of every user. Which clustering algorithm would be best to do this?
Relevant answer
  • asked a question related to Clustering Algorithms
Question
4 answers
How to apply ensemble clustering when the clustering algorithm produces a different number of clusters?
Relevant answer
Answer
Try to figure out is there any methods in which you can find a correlation or some other relationship among the heterogeneous outputs of the different clustering algorithms, then you can use the ensemble approach.
  • asked a question related to Clustering Algorithms
Question
6 answers
I have implemented the k-means to find the centroids of a cluster, now I want to find the "kpi" of the algorithm.
Relevant answer
Answer
You can evaluate the performance of k-means by convergence rate and by the sum of squared error(SSE), making the comparison among SSE.
read about it more at url
  • asked a question related to Clustering Algorithms
Question
8 answers
Hello,
I am a novice researcher in the field of clustering algorithms. DBSCAN was the one which its mathematical background draw my attention to itself but unfortunately, except biological- and medical-oriented fields, I couldn't imagine any other use cases for the algorithm. I would appreciate if anyone introduce me some useful resources which DBSCAN is used in fields except medical or biological ones. According to my bachelor degree (Industrial Engineering), I'm particularly looking for industrial and manufacturing instances.
Regards.
Relevant answer
Answer
Let’s think in a practical use of DBSCAN. Suppose we have an e-commerce and we want to improve our sales by recommending relevant products to our customers. We don’t know exactly what our customers are looking for but based on a data set we can predict and recommend a relevant product to a specific customer. We can apply the DBSCAN to our data set (based on the e-commerce database) and find clusters based on the products that the users have bought. Using this clusters we can find similarities between customers, for example, the customer A have bought 1 pen, 1 book and 1 scissors and the customer B have bought 1 book and 1 scissors, then we can recommend 1 pen to the customer B. This is just a little example of use of DBSCAN, but it can be used in a lot of applications in several areas.
  • asked a question related to Clustering Algorithms
Question
1 answer
I want to ask if one here has an idea about the stable approaches in such a way how can I compare my new developed stream clustering algorithm with for instance the stable approach proposed by Carlsson and Mèmoli in light of such invariant-order stability:
thank you in advance Dr. Hany
Rowanda Ahmed
Relevant answer
Answer
  • asked a question related to Clustering Algorithms
Question
16 answers
Hello everyone,
Currently I am trying to do K - mean clustering on microarray dataset which consists of 127 columns and 1000 rows. When I plot the graph, it gives an error like "figure margins too large". Then, I write this in R console:
par("mar") #It will give current dimensions
par(mar=c(1,1,1,1) #Tried to update the dimensions
But; it did not work. So, can anyone suggest me another way of fixing this problem? (Attached the some part of code in below)
Thanks,
Hasan
--------------------------------------------------------------------------------------------------------------
x = as.data.frame(x)
km_out = kmeans(x, 2, nstart = 20)
km_out$cluster
plot(x, col=(km.out$cluster+1), main="K - Means Clustering Results with K=2",xlab"", ylab"", pch=20, cex=2)
>Error in plot.new() : figure margins too large
Relevant answer
Answer
  • asked a question related to Clustering Algorithms
Question
6 answers
Hi,
I want to reduce the number of rows of my data set and until now i used some clustering algorithms (kmeans, kmedoids, SOM), but recently I discover some papers:
  • lp row sampling with Lewis weights ( Cohen, Peng),
  • Iterative row sampling (Li, Miller, Peng),
  • Compressive sampling (Candes).
I would like to know what is the best method taking into account the density of the variables of the data set?
Is my question meaningful? Or does it make no sense?
I mean, I want a true representation of my data set.
Thanks,
Robin
Relevant answer
Answer
Possibly, (random) graphs is what would help you
  • asked a question related to Clustering Algorithms
Question
9 answers
Hello everybody,
Currently, I am working on a text clustering algorithm that combines the potential of Self-Organizing Maps (SOM) and K-means as to create valid clusters and to show their neighborhood. So far it works well, but I would like to see other potential approaches to compare results.
The idea is to improve the selection of documentation while preparing for research in a specific area. For example, by having a huge databank of papers related to a specific subject and you what to see the relationship between the publications and select just the area of interest.
Attached is an image of the SOM after ofganizing 1061 publication titles.
Have anyone of you some experience in the text clustering area?
Thank you in advance.
Relevant answer
Answer
Fuzzy clustering is better than k-means clustering because it gives the more appropriate results. You can go through one of my paper which is applied on text data.
  • asked a question related to Clustering Algorithms
Question
10 answers
Hello, I'm a biologist interested in machine learning application in genomic data; specifically, I'm trying to apply clustering techniques to differential gene expression data.
I started by understand the basics of unsupervised learning and clustering algorithms with random datasets, but now I need to apply some of that algorithms (k-means, PAM, CLARA, SOM, DBSCAN...) to differential gene expression data and, honestly, I don't know where to begin, so I'd be grateful if someone can recommend me some tutorials or textbooks, or give me some tips.
Thank you for your time!
PD: I'm mainly using R language, but if Python tutorials are also OK for me.
Relevant answer
Regards,
Antonio
  • asked a question related to Clustering Algorithms
Question
3 answers
The semi-supervised clustering framework, namely, MPCPK-means, integrates both constraints and metric leaning approaches (bilenko et al., 2004}. Then, several semi-supervised clustering algorithms were proposed to incorporate pairwise constraints for a improved clustering solution. All these methods take long time to cluster a large size data (few thousands data objects) by incorporating pairwise constraints. This incorporation plays the key role for consuming time.
Relevant answer
Answer
Hi, you can you Approximate Computing (AC) to further minimize the running time of such algorithms for big data and save computational time and energy in the process. You can check out the different tools and technologies available for AC.
I have a survey paper on the same. You can have a look as well.
  • asked a question related to Clustering Algorithms
Question
5 answers
Is there any paper introduce an intuitive method for clustering evaluation?
I would like to use the most intuitive method like: minimizing the within-cluster distance and maximizing the distance between neighboring clusters, but not sure does this method have a name or something? Is there any related paper regarding it? The only method I'm using now is Silhouette score. Many thanks.
Relevant answer
Answer
INDIVIDUAL DIFFERENCES.
  • asked a question related to Clustering Algorithms
Question
8 answers
Dear,
I am working on improve the cluster efficiency by reducing the number of clusters and centers. Can you please suggest the best technique/algorithm for mixed type IoT large data.
Relevant answer
Answer
Thank you Basim Mahmood
  • asked a question related to Clustering Algorithms
Question
3 answers
I really do not understand the concept of micro-cluster. I know its a temporal extension of Cluster Feature used in BIRCH algorithm but still I am confused whether it is a data structure or something else?
Relevant answer
Answer
Micro- cluster is a temporal extension of the cluster feature, which compresses the data effectively. It defines the clusters by separating dense area from sparse ones. In clustering data streams, it is impractical to save all the incoming data objects. Micro-clusters are a popular technique in stream clustering, which maintain the compact representation of the clustering.
The following link may be useful for you.
  • asked a question related to Clustering Algorithms
Question
18 answers
For all people who want to get the code of ISODATA algorithm?! it has been done and now available for helping and saving your time, just sent me a message, it will be sent through. 
Relevant answer
Answer
sunupo@126.com。Thanks for your kindness
  • asked a question related to Clustering Algorithms
Question
4 answers
I need hierarchical clustering algorithm with single linkage method. whatever I search is the code with using Scikit-Learn. but I dont want that! I want the code with every details of this algorithm.
Relevant answer
Answer
Nguyen Van Thieu thank you so much!! your answer is really useful.
  • asked a question related to Clustering Algorithms
Question
5 answers
Hello, i have project in clustering using K-Means,
I have question, how to determine clusters in K-Means, because so far, scientist just create a clusters as ex : cluster_1, cluster_2, cluster_3, cluster_n
How if determine clusters in other as ex : cluster_germany, cluster_us, cluster_british etc ?
Relevant answer
Answer
The centroid of each data labeled group correspond to the cluster prototipe
  • asked a question related to Clustering Algorithms