Conference Paper

A comparative study of clustering methods using word embeddings

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For efficient data processing, several methods have been developed and implemented, such as LDA (Jelodar et al. 2019;Xue et al. 2020;Ali et al. 2019a;Ali et al. 2020a), LSA (Gefen et al. 2017;Ramya and Magesh 2019), and Word2Vec (Yuan et al. 2020;Yilmaz and Toklu 2020;Ali et al. 2019b;Ali et al. 2020b), etc. These methods facilitate data processing and provide an appropriate representation of data features (Bastas et al. 2019). In the proposed study, LSA, Word2Vec, and LDA are applied to represent data features. ...
... In order to evaluate performance of a topic modeling, there are several clustering methods such as fuzzy clustering (Ahanin and Ismail 2020), hierarchical clustering (Ding et al. 2018), density-based clustering (Jang et al. 2016), and K-means clustering (Santoso et al. 2020). In the proposed study, the K-means clustering algorithm is applied for evaluation of topic modeling as one of the most widely utilized algorithms in the literature (Bastas et al. 2019;Aghamohseni and Ramezanian 2015). ...
... This section provides a brief overview of documents that use the "K-means cluster" for topic modeling with "LDA," "LSA," or "Word2Vec." Bastas et al. (2019) compared the performance of five clustering approaches, including k-means, spherical k-means, probabilistic fuzzy C-means, agglomerative clustering, and non-negative matrix factorization, on features derived from the implementation of Paragraph-Vector and LDA models. They used two terrorism-related documents and one benchmark data-set. ...
Article
Full-text available
Topic detection from Twitter is a significant task that provides insight into real-time information. Recently, word embedding methods and topic modeling techniques have been utilized to find latent topics in various fields. Detecting topics leads to effective semantic structure and provides a better understanding of users. In the proposed study, different types of topic detection techniques are utilized, which are latent semantic analysis (LSA), Word2Vec, and latent Dirichlet allocation (LDA), and their performances are evaluated by the implementation of the K-means clustering technique on a real life application. In this case study, tweets were gathered after an earthquake with a magnitude of 6.6 on the Richter scale that took place on October 30, 2020, on the coast of the Aegean Sea (İzmir), Turkey. Tweets are clustered under fifteen hashtags separately, and the aforementioned techniques are applied to data-sets which vary in size. Therefore, the novelty of the proposed paper can be expressed as the comparison of different topic models and word embedding methods implemented for different sizes of documents in order to demonstrate the performance of these methods. While Word2Vec gives good results in small data-sets, LDA generally gives better results than Word2Vec and LSA in medium and large data-sets. Another aim of the proposed study is to provide information to decision makers for supporting victims and society. Therefore, the general situation of society is analyzed and society's attitude is demonstrated for decision-makers to take actionable activities such as psychological support, educational support, financial support, and political activities, etc.
... A major chunk of this data exists in the form of sequential textual data. Computer systems are well equipped to handle numerical data and perform well with numerical databases but this new form of data being generated necessitates the need for the development of specialized algorithms that convert this textual data into a form that can be understood by a machine (Bastas et al., 2019). ...
Conference Paper
It is a clearly established fact that good catego-rization results are heavily dependent on representation techniques. Text representation is a necessity that must be fulfilled before working on any text analysis task since it creates a base-line which even advanced machine learning models fail to compensate. This paper aims to comprehensively analyze and quantitatively evaluate the various models to represent text in order to perform Subjectivity Analysis. We implement a diverse array of models on the Cor-nell Subjectivity Dataset. It is worth noting that the BERT Language Model gives much better results than any other model but is significantly computationally expensive than the other approaches. We obtained state-of-the-art results on the subjectivity task by fine-tuning the BERT Language Model. This can open up a lot of new avenues and potentially lead to a specialized model inspired by BERT dedicated to subjectivity analysis.
Article
Full-text available
Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
Article
Full-text available
Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair-counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair-counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.
Conference Paper
Full-text available
Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Article
Full-text available
This paper transmits a FORTRAN-IV coding of the fuzzy c-means (FCM) clustering program. The FCM program is applicable to a wide variety of geostatistical data analysis problems. This program generates fuzzy partitions and prototypes for any set of numerical data. These partitions are useful for corroborating known substructures or suggesting substructure in unexplored data. The clustering criterion used to aggregate subsets is a generalized least-squares objective function. Features of this program include a choice of three norms (Euclidean, Diagonal, or Mahalonobis), an adjustable weighting factor that essentially controls sensitivity to noise, acceptance of variable numbers of clusters, and outputs that include several measures of cluster validity.
Article
Full-text available
In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors--namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)--that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.
Conference Paper
Full-text available
A goal of statistical language modeling is to learn the joint probabilit y function of sequences of words. This is intrinsically difficult because o f the curse of dimensionality: we propose to fight it with its own weap ons. In the proposed approach one learns simultaneously (1) a distributed r ep- resentation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these repr e- sentations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, sh owing on two text corpora that the proposed approach very significantly im- proves on a state-of-the-art trigram model.
Article
Full-text available
Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. We perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.
Article
Full-text available
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.
Article
Full-text available
In 1997, we proposed the fuzzy-possibilistic c-means (FPCM) model and algorithm that generated both membership and typicality values when clustering unlabeled data. FPCM constrains the typicality values so that the sum over all data points of typicalities to a cluster is one. The row sum constraint produces unrealistic typicality values for large data sets. In this paper, we propose a new model called possibilistic-fuzzy c-means (PFCM) model. PFCM produces memberships and possibilities simultaneously, along with the usual point prototypes or cluster centers for each cluster. PFCM is a hybridization of possibilistic c-means (PCM) and fuzzy c-means (FCM) that often avoids various problems of PCM, FCM and FPCM. PFCM solves the noise sensitivity defect of FCM, overcomes the coincident clusters problem of PCM and eliminates the row sum constraints of FPCM. We derive the first-order necessary conditions for extrema of the PFCM objective function, and use them as the basis for a standard alternating optimization approach to finding local minima of the PFCM objective functional. Several numerical examples are given that compare FCM and PCM to PFCM. Our examples show that PFCM compares favorably to both of the previous models. Since PFCM prototypes are less sensitive to outliers and can avoid coincident clusters, PFCM is a strong candidate for fuzzy rule-based system identification.
Article
Full-text available
We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection methods in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2 million customers and by analysing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad hoc modular networks.
Article
This review discusses practical benefits and limitations of novel data-driven research for social scientists in general and criminologists in particular by providing a comprehensive examination of the matter. Specifically, this study is an attempt to critically evaluate ‘big data’, data-driven perspectives, and their epistemological value for both scholars and practitioners, particularly those working on crime. It serves as guidance for those who are interested in data-driven research by pointing out new research avenues. In addition to the benefits, the drawbacks associated with data-driven approaches are also discussed. Finally, critical problems that are emerging in this era, such as privacy and ethical concerns are highlighted.
Article
Word embeddings resulting from neural language models have been shown to be a great asset for a large variety of NLP tasks. However, such architecture might be difficult and time-consuming to train. Instead, we propose to drastically simplify the word embeddings computation through a Hellinger PCA of the word co-occurence matrix. We compare those new word embeddings with some well-known embeddings on named entity recognition and movie review tasks and show that we can reach similar or even better performance. Although deep learning is not really necessary for generating good word embeddings, we show that it can provide an easy way to adapt embeddings to specific tasks.
Article
Digital Design and Computer Architecture is designed for courses that combine digital logic design with computer organization/architecture or that teach these subjects as a two-course sequence. Digital Design and Computer Architecture begins with a modern approach by rigorously covering the fundamentals of digital logic design and then introducing Hardware Description Languages (HDLs). Featuring examples of the two most widely-used HDLs, VHDL and Verilog, the first half of the text prepares the reader for what follows in the second: the design of a MIPS Processor. By the end of Digital Design and Computer Architecture, readers will be able to build their own microprocessor and will have a top-to-bottom understanding of how it works--even if they have no formal background in design or architecture beyond an introductory class. David Harris and Sarah Harris combine an engaging and humorous writing style with an updated and hands-on approach to digital design. Unique presentation of digital logic design from the perspective of computer architecture using a real instruction set, MIPS. Side-by-side examples of the two most prominent Hardware Design Languages--VHDL and Verilog--illustrate and compare the ways the each can be used in the design of digital systems. Worked examples conclude each section to enhance the reader's understanding and retention of the material. Companion Web site includes links to CAD tools for FPGA design from Synplicity and Xilinx, lecture slides, laboratory projects, and solutions to exercises.
Article
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
Article
Many intuitively appealing methods have been suggested for clustering data, however, interpretation of their results has been hindered by the lack of objective criteria. This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data. These criteria depend on a measure of similarity between two different clusterings of the same set of data; the measure essentially considers how each pair of data points is assigned in each clustering.
Conference Paper
The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.
Article
It has long been realized that in pulse-code modulation (PCM), with a given ensemble of signals to handle, the quantum values should be spaced more closely in the voltage regions where the signal amplitude is more likely to fall. It has been shown by Panter and Dite that, in the limit as the number of quanta becomes infinite, the asymptotic fractional density of quanta per unit voltage should vary as the one-third power of the probability density per unit voltage of signal amplitudes. In this paper the corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy. The optimization criterion used is that the average quantization noise power be a minimum. It is shown that the result obtained here goes over into the Panter and Dite result as the number of quanta become large. The optimum quautization schemes for 2^{b} quanta, b=1,2, cdots, 7 , are given numerically for Gaussian and for Laplacian distribution of signal amplitudes.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.
Article
The clustering problem is cast in the framework of possibility theory. The approach differs from the existing clustering methods in that the resulting partition of the data can be interpreted as a possibilistic partition, and the membership values can be interpreted as degrees of possibility of the points belonging to the classes, i.e., the compatibilities of the points with the class prototypes. An appropriate objective function whose minimum will characterize a good possibilistic partition of the data is constructed, and the membership and prototype update equations are derived from necessary conditions for minimization of the criterion function. The advantages of the resulting family of possibilistic algorithms are illustrated by several examples
Article
Unstructured text documents are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by ...