Article

A Dirichlet multinomial mixture model-based approach for short text clustering

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... (1) latent Dirichlet allocation (LDA), the most widely known and used approach to topic modeling (Anoop & Sreelakshmi, 2023;Blei et al., 2003); (2) Gibbs sampling Dirichlet multinomial mixture (GSDMM), a topic modeling algorithm designed for short social media texts (Yin & Wang, 2014); and (3) BERTopic (Grootendorst, 2022), a deep neural network that leverages the Bidirectional Encoder Representations from Transformers (BERT) language model (Devlin et al., 2019). As with other machine learning approaches, where results from multiple algorithms are compared (Cascalheira et al., 2023a, b;Ovalle et al., 2021;Saha et al., 2019), we used three topic modeling algorithms to increase the interpretability of the modeling results. ...
... First, social media posts are short texts (i.e., usually a few sentences long), meaning they rarely have multiple topics within them (Blei et al., 2003;Vayansky & Kumar, 2020) which, in turn, reduces the effectiveness of traditional topic modeling approaches such as LDA (Vayansky & Kumar, 2020). To account for this limitation, we used GSDMM, as it computes only one topic per social media post (Yin & Wang, 2014). Second, because social media posts are short in length, they often contain a lot of noise (i.e., topic modeling algorithms present outliers as representative topics; Li et al., 2018), which presents a challenge in isolating useful signals (i.e., well-defined, robust topics). ...
... In GSDMM, each social media post is randomly assigned to K clusters and, during each iteration of the algorithm, the clusters are reassigned to each document according to a conditional distribution (Yin & Wang, 2014). The GSDMM algorithm was executed with the gsdmm script implemented by Walker (2021). ...
Article
Full-text available
Introduction During the summer of 2022, monkeypox (mpox) became a public health emergency primarily affecting members of sexual and gender minority (SGM) communities. Understanding how SGM social media users perceived mpox could inform community-based public health prevention strategies for future disease outbreaks. However, social media platform selection could substantially affect the insights drawn about mpox. Methods The present study used topic modeling and tested differences in psycholinguistic attributes using 8,605 mpox-related Reddit posts and 1,819 mpox-related tweets collected from 01 May 2022 to 13 October 2022. Results Results showed substantial differences in mpox communication themes between the two groups. The SGM-specific Reddit sample primarily discussed concerns with how mpox predominately affects gay, bisexual, and other men who have sex with men, desires to find and share information about mpox vaccination, and motives to understand how mpox spreads among the community. The non-SGM-specific Twitter sample primarily discussed news reports about mpox cases provided by U.S. agencies, mpox from an international perspective, and concerns about mpox in the context of COVID-19. Group differences in psycholinguistic attributes were also evident. Conclusions Our findings underscore the importance of community-centered public health surveillance because mpox communication differed significantly between SGM-specific Reddit communities vs. non-SGM-specific Twitter users. Public health surveillance using social media must account for these communication differences, as conclusions about a population’s reaction to disease can differ substantially based on the platform surveyed. Policy Implications Public health officials can use these findings to design community- and platform-tailored prevention messaging for future outbreaks of mpox and other infectious diseases.
... To effectively handle the sparse short texts, existing topic models mainly focus on word co-occurrence enrichment, so as to estimate more discriminative topic representation for each short text. For example, the biterm topic model (BTM) [11] directly leverages global word co-occurrences at the corpus level; and the Dirichlet multinomial mixture (DMM) [57] supposes that each short text covers only a single topic, indirectly enriching word cooccurrences at the document level. Recently, they have been upgraded by incorporating pre-trained word embeddings [28,29] and variational manifold regularization [31]. ...
... The code is available on the net. 6 • GSDMM [57]: This is the DMM topic model inferred by Gibbs sampling. In GSDMM, each document is supposed to cover only a single topic, so as to be applicable to short texts. ...
... The first observation is that our A 2 snmf significantly outperforms the traditional k-means and Ncut methods across all datasets. Unsurprisingly, the k-means method empirically fails to the clustering of short texts just as the previous reports [54,57], e.g., it only gets about 0.25 and 0.36 ACC scores on Snippets and StackOverFlow, respectively. The results give further empirical evidence that the prototype-based clustering methods may lose effectiveness on the BoW features of short texts, due to the sparsity problem. ...
Article
Full-text available
Short text clustering is a significant yet challenging task, where short texts generated from the Internet are extremely sparse, noisy, and ambiguous. The sparse nature makes traditional clustering methods, e.g.,k-means family and topic modeling, much less effective. Fortunately, recent arts of document distance, e.g., word mover’s distance, and document representation, e.g., BERT, can accurately measure the similarities of short texts, especially their nearest neighbors. Inspired by those arts and observations, we induce short text clusters by directly factorizing the informative affinity matrix of nearest neighbors into the product of the cluster assignment matrix, following the intuition that neighboring short texts tend to be assigned to the same cluster. However, due to the noisy nature of short texts, many of them can be regarded as outliers or near outliers, resulting in many noisy neighboring similarities within the affinity matrix. To further alleviate this problem, we enhance the affinity matrix factorization by (1) incorporating a sparse noisy matrix to directly capture noisy neighboring similarities and (2) regularizing the cluster assignment matrix by 2,1\ell _{2,1} norm to eliminate hard-to-clustering short texts (called pseudo-outliers), so as to indirectly neglect noisy neighboring similarities corresponding to them. After this factorization for pre-clustering, we train a classifier over the resulting clusters and adopt it to assign each pseudo-outlier to one cluster finally. We call this novel clustering method as anomaly-aware symmetric non-negative matrix factorization (A2\hbox {A}^{2}snmf). Experimental results on benchmark short text datasets demonstrate that A2\hbox {A}^{2}snmf performs very competitively with the existing baseline methods. The code is available at the website https://github.com/wizardbo/A3SNMF_functions.
... The authors in [9] proposed a DMM model with feature partition. Moreover, the Gibbs Sampling method is proposed based on DMM (GSDMM) for short text clustering [14], GSDMM needs the maximum number of possible clusters (k) to find the optimal number of clusters [5]. The main drawback of this method is the high computational cost (space and time) when assigning a high number to parameter k. ...
... Addressing this issue, Nigam et al. [12] introduced the Dirichlet Multinomial Mixture (DMM) model to mitigate the complexities associated with highdimensional representations in text data. Building upon this approach, a more refined variation known as the Gibbs Sampling DMM (GSDMM) was later introduced in [14], this improvement in model design offers better convergence and the capability to automatically infer the optimal number of topics. ...
... This prevalence of limited co-occurrences poses a challenge in determining the ideal number of clusters, potentially leading to suboptimal outcomes in the categorization of topics [43,44]. Various strategies, such as text augmentation [45], topic modeling [46], and the Dirichlet Mixture Model [14,47,48], have been proposed to address the challenge of sparsity in short-text clustering. Moreover, In the realm of short-text clustering models, studies have delved into innovative approaches. ...
Article
Full-text available
Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.
... Existing efforts detecting topics from short texts can be generally classified into two groups: internal models that focus on extracting more information from the given data through different mechanisms [4,7,22,24,29,32] and external approaches that leverage semantic information learned from information sources other than the given data (e.g., pre-trained word embeddings) [14,18,21,27,30]. ...
... For example, the Biterm Topic Model (BTM) [4] discovers topics by directly modeling on biterms and each biterm consists of a pair of words appearing together in a short context. The Dirichlet Multinomial Mixture model (DMM) [32] assigns only one topic to each short-text document by assuming all words in the document are generated from the same topic distribution. The Self-Aggregation based Topic Model (SATM) [22] extracts topics from long pseudo documents formed by semantically similar texts. ...
Article
Full-text available
Given the prevalence of short texts as a popular form of information on the Internet, inferring latent topics from short texts has attracted increasing interest from both academia and industry. To address the data sparsity of short texts in terms of word co-occurrences, existing research efforts either try to extract more information from the given data internally or leverage externally learned semantic information such as pre-trained word embeddings. In this paper, we propose a novel model, called Dual-Reinforced Topic Model (DRTM), to identify topics from short texts by harnessing both internal and external semantic information. Improving existing internal methods that consider only first-order co-occurrence relations between words, our model exploits multi-order relations so that the relevance between words not explicitly appearing together in the given data can be captured. Addressing the limitation of existing external methods that utilize only distributed representations at the word level, we further incorporate document representations into our model to facilitate topic modeling. We have evaluated our model on multiple publicly available datasets. Our experimental results have demonstrated that DRTM clearly outperforms existing internal and external methods in terms of both topic coherence and document classification accuracy.
... Another very popular technique most of the researchers used for short text topic modeling is Gibbs Sampling Dirichlet Mixture Model (GSDMM) [12]. This approach outlines the process of iterating through clusters and reassigning them in accordance with a conditional probability. ...
... suggested a Bayesian inference of class specified topic model, with the supervised scenario being a specific instance. This model's basic assumption is a single topic sample for each text [12,14]. This model claims to address the sparsity problem of short text clustering and also displaying word topics, but it has not captured the semantics of the words in the process of model generation. ...
Article
Full-text available
Twitter has emerged as a significant source of data to be used for text summarization, Topic Modeling, document clustering, information retrieval, sentiment analysis, etc. Using hashtags, Twitter users may categorize their tweets as hashtags provide the essential meta-information in connecting tweets to the underlying themes. However, the majority of tweets do not have hashtags, which makes it challenging to search for a particular theme. The proposed model is designed to recommend appropriate hashtags for tweets by considering their categorization into sports, politics, health, or technology. In this paper, we proposed a novel heuristic for recommending hashtags of tweets. Taking 20,000 tweets, which includes 5000 tweets from each of the specified four topics along with their respective hashtags. These hashtags were manually assigned by a group of experts, which were subsequently excluded during the topic modeling process. The basic data-cleaning technique is applied to clean and tokenize the tweets. Then Word2Vec technique is used to vectorize the tokens which captured the semantic meaning of the words in the tweets and overcomes the data sparsity issues. The dimension of the data is reduced using Singular Value Decomposition (SVD) followed by t-SNE (t-distributed Stochastic Neighbor Embedding). The reduced data is divided into four clusters and a semi-supervised method is introduced to link these clusters to the aforementioned topics, which eventually helped to produce the hashtag for a list of tweets. On comparison of our results with existing techniques, it is observed that our model performance is better with respect to the metrics: precision, recall, and F1-score.
... We employ the Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) (Yin & Wang, 2014) topic modeling framework designed explicitly for extracting topics from short text documents. Notably, GSDMM's assumption of only one topic per document aligns well with the concise nature of the tweets, in contrast to Latent Dirichlet Allocation (LDA) (Blei et al., 2001), which typically accommodates multiple topics in documents. ...
... We cluster the tweets by topics and label them using the GSDMM topic modeling framework discussed in Section 3. The allocation of tweets by the topic label for five of the most prominent topics is shown in Table 1. We omit the interpretation of the resulting clusters here and refer the reader to Yin and Wang (2014) for more details. The resulting pre-processed dataset contains 107, 017 users and 2.7 million tweet samples. ...
Article
Full-text available
In the contemporary digital era, individuals are frequently inundated with voluminous content on their home timelines, leading to information overload. As a result, even relevant content might get sidelined and overlooked by users, resulting in a long-term negative impact on engagement. While existing state-of-the-art recommender systems address this problem successfully to an extent using the historical engagement data and users’ social graphs, they don’t explicitly account for users’ choice behavior on engagement in these platforms. In this work, we address the problem of maximizing user engagement with tweet content by explicitly considering the choice behavior of users. We formulate the engagement forecasting task as a multi-label classification problem that captures choice behavior based on an unsupervised clustering of tweet-topics. We propose a neural network model that incorporates engagement history and predicts user choices conditional on this context. We study the impact of recommending tweets on engagement outcomes by solving an appropriately defined tweet optimization problem based on our proposed model using a large publicly available dataset. Our extensive experiments reveal that our model outperforms the state-of-the-art recommendation methods and discrete choice models from the existing literature across multiple performance metrics.
... Another very popular technique most of the researchers used for short text topic modeling is Gibbs Sampling Dirichlet Mixture Model (GSDMM) [20,21]. This approach outlines the process of iterating through clusters and reassigning them in accordance with a conditional probability. ...
... suggested a Bayesian inference of class-specified topic model, with the supervised scenario being a specific instance. This model's basic assumption is, a single topic sample for each text [20,23]. This model asserts its ability to tackle the sparsity issue encountered in short text clustering while also presenting word topics. ...
Article
Full-text available
Social media stands as a crucial information source across various real-world challenges. Platforms like Twitter, extensively used by news outlets for real-time updates, categorize news via hashtags. These hashtags act as pivotal meta-information for linking tweets to underlying themes, yet many tweets lack them, posing challenges in topic searches. Our contribution addresses this by introducing a novel heuristic for hashtag recommendation. Extracting 20 thousand tweets, 5000 each from distinct categories health, sports, politics, and technology we applied fundamental data cleaning and tokenization techniques. Leveraging Word2Vec, we vectorized tokens, capturing nuanced semantic meanings and mitigating data sparsity issues. The proposed heuristic creates clusters of different topic by combining these embedded features and idea of fuzzy C-Means technique. Develop a rule-based approach that combines both supervised and unsupervised methods to label clusters, indicating their respective topic. The experimental outcomes shows that our proposed techniques achieve better performance metrics in precision, recall, and F1-score compared to specific baseline models.
... Due to the sparsity of short text, a short text is likely to contain only one topic. Based on this idea, Yin [14] proposed GSDMM (A collapsed gibbs from algorithm for dirichlet multinomial mixture model). GSDMM is often superior to the LDA model in short text and sparse text analysis [15]. ...
... The topic keywords in the question sentence play a significant role in understanding the question sentence. We design the topic keyword filtering algorithm (GSDMM-Filter) based on GSDMM which is different from the previous GSDMM [14] algorithm and the GSDMM-Filter algorithm can extract the topic keywords. The GSDMM automatically counts the keyword distribution of each topic and the number of times that the keywords appear in a topic when completing the interrogative information processing and clustering. ...
... Recent research has turned to the Dirichlet process multinomial mixture (DPMM) model, which has demonstrated significant progress in short text clustering. Models such as DCT [15], GSDMM [34], and GSDPMM [35] have been proposed based on this approach, and DPMM-based methods have become the most common baseline models for short text stream clustering due to their ability to automatically infer clusters and effectively handle topic drift. ...
... When only unlabeled documents are available, the algorithm can be regarded as a clustering model. Yin and Wang [34] introduced a folded Gibbs sampling algorithm for the DMM model to solve the short text clustering problem. Yu et al. [38] proposed a block Gibbs sampling algorithm for approximating DPMM models, but this method converges slowly. ...
Article
Full-text available
Short text streams, such as social media comments, are continuously generated, making effective clustering methods essential for extracting valuable information. However, existing research fails to address the problem of topic concentration in clustering, which leads to multiple topics being confused in one cluster, making it challenging to summarize the center of clustering. To tackle this issue, this paper proposes a novel topic-enhanced clustering method called TEDM, based on the Dirichlet model. The method uses dynamic clustering, leveraging topic information to improve the sampling of documents and better cluster documents on the same topic. TEDM constructs a dynamic word relation graph to extract topic terms, which is updated with the stream of documents to cope with the dynamic changes in topics. Extensive experimental studies demonstrate that TEDM outperforms state-of-the-art works on multiple real datasets.
... As a baseline for comparing short and conventional length topic modeling, we include GSDMM [16] and LDA, respectively. LDA sees the most frequent use when topic modeling for conversation is considered. ...
... FTM first applies dimensionality reduction through PCA and afterward uses fuzzy c-means clustering to assign words to topics. Yin et al. propose a short text clustering method called Gibbs Sampling for the Dirichlet Multinomial Mixture model (GSDMM) [16]. This generative method assumes a document corresponds to a single topic. ...
... See Yin and Wang[62] for the definition of α and β. ...
Preprint
Full-text available
In the past few years, "metaverse" and "non-fungible tokens (NFT)" have become buzzwords, and the prices of related assets have shown speculative bubble-like behavior. In this paper, we attempt to better understand the underlying economic dynamics. To do so, we look at Decentraland, a virtual world platform where land parcels are sold as NFT collections. We find that initially, land prices followed traditional real estate pricing models -- in particular, value decreased with distance from the most desirable areas -- suggesting Decentraland behaved much like a virtual city. However, these real estate pricing models stopped applying when both the metaverse and NFTs gained increased popular attention and enthusiasm in 2021, suggesting a new driving force for the underlying asset prices. At that time, following a substantial rise in NFT market values, short-term holders of multiple parcels began to take major selling positions in the Decentraland market, which hints that, rather than building a metaverse community, early Decentraland investors preferred to cash out when land valuations became overly inflated. Our analysis also shows that while the majority of buyers are new entrants to the market (many of whom joined during the bubble), liquidity (i.e., parcels) was mostly provided by early adopters selling, which caused stark differences in monetary gains. Early adopters made money -- more than 10,000 USD on average per parcel sold -- but users who joined later typically made no profit or even incurred losses in the order of 1,000 USD per parcel. Unlike established markets such as financial and real estate markets, newly emergent digital marketplaces are mostly self-regulated. As a result, the significant financial risks we identify indicate a strong need for establishing appropriate standards of business conduct and improving user awareness.
... Traditional topic modeling algorithms like PLSA and LDA are widely used to uncover latent semantic structures in text corpora by relying on word co-occurrence patterns at the doc-ument level. However, these methods require a high frequency of word co-occurrences to generate meaningful topics, leading to significant performance degradation when applied to short texts where such information is sparse (Yin and Wang, 2014;Yan et al., 2013). Similarly, the performance of LSI declines over short texts as the detected topics become ambiguous, resulting in negative values in its decomposed matrices that are difficult to interpret (Murshed et al., 2023;Alghamdi and Alfalqi, 2015). ...
Preprint
Full-text available
As short text data in native languages like Hindi increasingly appear in modern media, robust methods for topic modeling on such data have gained importance. This study investigates the performance of BERTopic in modeling Hindi short texts, an area that has been under-explored in existing research. Using contextual embeddings, BERTopic can capture semantic relationships in data, making it potentially more effective than traditional models, especially for short and diverse texts. We evaluate BERTopic using 6 different document embedding models and compare its performance against 8 established topic modeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive Regularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis (PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec. The models are assessed using coherence scores across a range of topic counts. Our results reveal that BERTopic consistently outperforms other models in capturing coherent topics from short Hindi texts.
... Due to the difficulty of using LDA to extract correlations between topics, Blei et al. [17] proposed the Correlated Topic Model (CTM) in 2006, an improved version of LDA that utilizes log-normal distributions and covariance matrices. Moreover, Nigam et al. [18,19] proposed the Dirichlet Multinomial Mixture (DMM) model based on the naive Bayes assumption for short text data processing. To address the challenge of feature sparsity in short texts, Yan et al. [20] devised the innovative Biterm Topic Model (BTM), which can mine all biterms in the text collection and directly perform topic inference on the biterm set. ...
Article
Full-text available
Geological disasters, as a common occurrence, have a serious impact on social development in terms of their frequency of occurrence, disaster effects, and resulting losses. To effectively reduce the casualties, property losses, and social effects caused by various disasters, it is necessary to conduct real-time monitoring and early warning of various geological disaster risks. With the growing development of the information age, public attention to disaster relief, casualties, social impact effects, and other related situations has been increasing. Since social media platforms such as Weibo and Twitter contain a vast amount of real-time data related to disaster information before and after a disaster occurs, scientifically and effectively utilizing these data can provide sufficient and reliable information support for disaster relief, post-disaster recovery, and public appeasement efforts. As one of the techniques in natural language processing, the topic model can achieve precise mining and intelligent analysis of valuable information from massive amounts of data on social media to achieve rapid use of thematic models for disaster analysis after a disaster occurs, providing reference for post-disaster-rescue-related work. Therefore, this article first provides an overview of the development process of the topic model. Secondly, based on the technology utilized, the topic models were roughly classified into three categories: traditional topic models, word embedding-based topic models, and neural network-based topic models. Finally, taking the disaster data of “Dongting Lake breach” in Hunan, China as the research object, the application process and effectiveness of the topic model in urban geological disaster information mining were systematically introduced. The research results provide important references for the further practical innovation and expansion of the topic model in the field of disaster information mining.
... Clustering, also known as unsupervised learning, is a process of discovery and exploration for investigating inherent and hidden structures within a large dataset [10]. It has been extensively applied to a variety of tasks [17,32,11,45,21,18,47,30,46,41,20]. Many clustering algorithms have been proposed in different scientific disciplines [13], and these methods often differ in the selection of objective functions, probabilistic models or heuristics adopted. ...
Preprint
Full-text available
Discovering clusters from a dataset with different shapes, densities, and scales is a known challenging problem in data clustering. In this paper, we propose the RElative COre MErge (RECOME) clustering algorithm. The core of RECOME is a novel density measure, i.e., Relative K nearest Neighbor Kernel Density (RNKD). RECOME identifies core objects with unit RNKD, and {partitions} non-core objects into atom clusters by successively following higher-density neighbor relations toward core objects. Core objects and their corresponding atom clusters are then merged through α\alpha-reachable paths on a KNN graph. We discover that the number of clusters computed by RECOME is a step function of the α\alpha parameter with jump discontinuity on a small collection of values. A fast jump discontinuity discovery (FJDD) method is proposed based on graph theory. RECOME is evaluated on both synthetic datasets and real datasets. Experimental results indicate that RECOME is able to discover clusters with different shapes, densities, and scales. It outperforms six baseline methods on both synthetic datasets and real datasets. Moreover, FJDD is shown to be effective to extract the jump discontinuity set of parameter α\alpha for all tested datasets, which can ease the task of data exploration and parameter tuning.
... In the extreme case of very short texts, some authors showed the efficiency of models such as Biterm Topic Models (BTMs) (Cheng et al. 2014) or methods incorporating word embeddings in topic modelling (Qiang et al. 2017), in order to solve the issue of the sparsity of co-occurrences. Another interesting example is Dirichlet Multinomial Mixture (DMM) models (Yin et al. 2014), which assume that each text is sampled by only one topic. We must stress, however, that our choice to focus on only one part of speech (nouns)-for comparison purposes with the KBS method-changes the perspective concerning conventional short texts' analysis. ...
Article
Full-text available
This study compares and contrasts the results of two lexical-based methods aimed at identifying content temporal trends in diachronic text corpora. A corpus of end-of-year addresses of the presidents of the Italian Republic constitutes a relevant case of political speech useful to understand how the temporal evolution of topics can be represented and whether a downward (ex post) or an upward (ex ante) extraction of topics is more effective for the identification of presidents’ distinctive traits and trends. The first method is a knowledge-based system (KBS), which identifies clusters of words sharing a similar temporal pattern through a three-step statistical learning procedure. The second is a structural topic model (STM), which identifies main topics by probing the possible effect of the year and president factors on the speech-topic and the topic-word distributions. In KBS clusters, the individual trait of the president stands out as one of the most relevant elements and determines the contents of speeches; moreover, topic trends can also be discerned ex post while interpreting the results. On the other hand, STM directly achieves the whole topic structure but seems not as powerful as expected in portraying the life cycle of words and detecting groups of words that distinguish the speeches of a specific president. As most presidential speeches are rich and cover a wide range of topics, the results suggest that, in this case, the interpretative tool offered by STM brings out more challenges than strengths. Conversely, direct observation of the temporal trajectory of individual words allows for more detailed analyses and meaningful results, thanks to the flexible and adaptive KBS approach.
... tBERT [30] combines topic modeling with BERT for semantic similarity prediction, using LDA and GSDMM [31] as topic models. tBERT has demonstrated a performance improvement compared with BERT alone, highlighting the benefit of integrating topic information. ...
Article
Full-text available
As an extension of the transformer architecture, the BERT model has introduced a new paradigm for natural language processing, achieving impressive results in various downstream tasks. However, high-performance BERT-based models—such as ELECTRA, ALBERT, and RoBERTa—suffer from limitations such as poor continuous learning capability and insufficient understanding of domain-specific documents. To address these issues, we propose the use of an attention mechanism to combine BERT-based models with neural topic models. Unlike traditional stochastic topic modeling, neural topic modeling employs artificial neural networks to learn topic representations. Furthermore, neural topic models can be integrated with other neural models and trained to identify latent variables in documents, thereby enabling BERT-based models to sufficiently comprehend the contexts of specific fields. We conducted experiments on three datasets—Movie Review Dataset (MRD), 20Newsgroups, and YELP—to evaluate our model’s performance. Compared to the vanilla model, the proposed model achieved an accuracy improvement of 1–2% for the ALBERT model in multiclassification tasks across all three datasets, while the ELECTRA model showed an accuracy improvement of less than 1%.
... As we include some unlabeled instances from the NoCom category in data generation, rule-based data contains more data samples with 224 Promo, 164 Phasing, 216 POS and 130 Other instances. We experimented with different topic modelling techniques [26,51] including LDA, Dirichlet Multinomial Mixture model with Gibbs sampling (GSDMM) and Dirichlet Multinomial Mixture based model using a generalized pólya urn scheme (GPU-DMM) for automated label generation. These methods generate the labels based on the textual commentaries (e.g., see Commentary column in Table 2). ...
Article
Full-text available
Companies generate operational reports to measure business performance and evaluate discrepancies between actual outcomes and forecasts. Analysts comment on these reports to explain the causes of deviations. In this paper, we propose a machine learning-based framework to predict the commentaries from the operational data generated by a company. We use time series classification to predict labels for the existing commentaries, and compare various machine learning models for the prediction task including XGBoost, long short term memory networks and fully convolutional networks (FCN). Classification models are trained on three datasets and their performance is evaluated in terms of accuracy and F1-score. We consider AI interpretability as an additional component in our framework to better explain the predictions to the decision makers. Our numerical study shows that FCN architecture provides higher classification performance, and Class Activation Maps and SHAP interpretability methods provide intuitive explanations for the model predictions. We find that the proposed framework that is enabled by machine learning-based methods offers new avenues to leverage management information systems for providing insights to the managers on key financial issues including sales forecasting and inventory management.
... Given this, a short-text topic modelling approach was employed, using the Gibs Sampling Dirichlet Mixture Model (GSDMM) algorithm. Unlike LDA, which assumes multiple topics per document and gives a probability distribution for each document (Blei et al 2003), the GSDMM algorithm assumes a single topic per document (Yin and Wang 2014). The mathematical background for GSDMM is given in the Appendix. ...
Article
Full-text available
Food systems actors are key enablers or barriers to transformation toward social and ecological sustainability. We mapped 1422 UK food system actors across different sub-sectors, scales, organisational levels, and specialisms. We then surveyed the priorities for transformation (n = 1190 text responses) among a cross-section of this group (n = 372) and conducted quantitative and qualitative thematic analysis. Of the 58 identified priorities, most frequent were those regarding agroecological, organic and regenerative production, the localisation of food systems, reducing animal sourced foods and dietary change, and addressing power relations. Less frequent were those related to technology and innovation. We highlight potential positive and negative outcomes of these priorities and compare results with England’s Food Strategy White Paper and recommendations from global food systems reports. We close by offering a concrete set of 15 priorities for food systems transformation to be taken forward by policy and practice.
... In this stage, performance metrics were utilised to evaluate the MVR by comparing the clusters generated using the MVRs and single-view representations by comparing the clusters generated by the model to the correct clusters [64,65]. Three common metrics for clustering performance measures were used: normalised mutual information (NMI) [66], Adjusted Mutual Information (AMI) [67] and purity (P) [68]. Where represents the total number of texts. ...
Article
Full-text available
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
... has been used in the clustering of short texts from the data of title and annotation fields. This utility implements the Gibbs sampling algorithm for a Dirichlet Mixture Model [15]. ...
Preprint
Full-text available
Objectives. The aim of this study was to demonstrate the ability to visualize the results of the Scilit platform's bibliometric data analysis on the topic "AI & Machine Learning" to identify publications reflecting specific issues of the topic. Data source. Bibliometric records exported from the Scilit platform on the topic "AI & Machine Learning" for the years 2021–2023 were used. For each year, 6,000 records were downloaded in CSV and RIS format. Programs and utilities used. VOSviewer, Scimago Graphica, Inkscape, FP-growth utility, GSDMM algorithm. Used services: Elicit, QuillBot, Litmaps. Results. It has been shown that bibliometric data from the open access abstract database Scilit can serve as a quality alternative to subscription-only databases. Data exported from the Scilit platform require preprocessing to make them available in a format that can be processed by programs such as VOSviewer and Scimago Graphica. The use of GSDMM and FP-growth algorithms is effective for structuring bibliometric data for further visualization. The Scimago Graphica software provides wide possibilities for building compound diagrams, in particular, for representing the network of keywords in such important coordinates for bibliometric analysis as average year of publication and average normalized citation, as well as for building an alluvial diagram of co-occurrence of more than two keywords. The possibility of using such services as elicit.com, quillbot.com and app.litmaps.com to accelerate the selection of publications on the topic under study is shown.
... However, a notable limitation of BTM lies in its reliance on only biterm interactions between terms, neglecting higher-order word co-occurrences. Yin and Wang developed a collapsed Gibbs Sampling algorithm for Dirichlet Multinomial Mixture (GSDMM) model for short text clustering [28]. GSDMM can automatically determine the number of clusters and achieves high performance in clustering by focusing on homogeneity and completeness criteria. ...
... Despite the efficiency of LDA on topic modeling for larger documents, BERTopic performs more efficiently for short text such as posts [10,29]. Therefore, we compare the performance of LDA with ChatGPT against BERTopic with ChatGPT for topic modeling and interpretation of our posts. ...
Chapter
Full-text available
Understanding the sentiment trends of large and unstructured text corpora is essential for various applications. Despite extensive application of sentiment analysis and topic modeling, extracting meaningful insights from the vast amount of textual data generated on social media platforms presents unique challenges due to the short and noisy nature of the text. In this study, we propose a methodology for analyzing sentiment trends in social media, including data collection, data preprocessing, sentiment analysis, social network graph construction, and topic modeling interpretation using ChatGPT. By integrating ChatGPT with topic modeling techniques such as LDA and BERTopic, we aim to enhance the interpretability of sentiment-related topics and gain deeper insights into sentiment trends in social media conversations. Through a case study focusing on parental hesitancy toward child vaccination, we illustrate the applicability and utility of our proposed methodology in real-world social media analysis scenarios, demonstrating its effectiveness in topic modeling interpretation and enhancing understanding of social media discourse. The integration of ChatGPT and BERTopic yielded improved topic interpretation for the short text of large corpus based on the coherence score of the original posts and generated description of the topic, ultimately reducing the cost and time required for topic interpretation by humans.
... Já o GSDMMé um modelo que considera umúnico tópico por documento, e foi criado para ter um bom desempenho em textos curtos como os das redes sociais. Ele consegue inferir a quantidadeótima de tópicos,é preciso apenas definir um valor de limite superior [Yin and Wang 2014]: ele foi definido em 30 e realizadas 100 iterações. A convergência do modelo ocorreu para 18 tópicos de reclamações, os quais foram reduzidos manualmente para 9 de acordo com a semelhança semântica entre eles. ...
Conference Paper
Uma parte significativa dos comentários em perfis de empresas de apostas esportivas referem-se a reclamações e problemas reportados pelos clientes. O presente trabalho testou métodos de aprendizado supervisionado para classificar comentários como reclamações em dados coletados do Instagram, sendo escolhido o modelo SVM. Técnicas de modelagem de tópicos foram aplicadas nos comentários classificados como reclamações, e foi selecionado o algoritmo GSDMM, tornando possível obter os principais problemas relatados pelos usuários. Os modelos selecionados foram implementados em um protótipo online que permite a inserção e análise de novos comentários.
... Conducting a clustering analysis and evaluation of different algorithms (e.g. [40][41][42]) could help us to improve the clusters produced by embodying a more efficient algorithm. Eventually, we will try to incorporate even larger academic datasets (tens or hundreds of millions of papers), whereas we may add a web crawling-scraping feature to discover and insert the most recently published papers in the system's database. ...
Article
Full-text available
Recommendation (recommender) systems (RS) have played a significant role in both research and industry in recent years. In the area of academia, there is a need to help researchers discover the most appropriate and relevant scientific information through recommendations. Nevertheless, we argue that there is a major gap between academic state-of-the-art RS and real-world problems. In this paper, we present a novel multi-staged RS based on clustering, graph modeling and deep learning that manages to run on a full dataset (scientific digital library) in the magnitude of millions users and items (papers). We run several tests (experiments/evaluation) as a means to find the best approach regarding the tuning of our system; so, we present and compare three versions of our RS regarding recall and NDCG metrics. The results show that a multi-staged RS that utilizes a variety of techniques and algorithms is able to face real-world problems and large academic datasets. In this way, we suggest a way to close or minimize the gap between research and industry value RS.
... Another important aspect of topic modeling is its application to short documents. To address this, various methods have been proposed, such as Sentence-LDA [66], which models topics at the sentence-level, and Dirichlet Multinomial Mixture Model (DMM) [91], Biterm topic model [88], and Dirichlet Process Multinomial Mixture Model (DPMM) [66], which are specifically designed for short text topic modeling. ...
Article
Full-text available
Topic modeling aims to discover latent themes in collections of text documents. It has various applications across fields such as sociology, opinion analysis, and media studies. In such areas, it is essential to have easily interpretable, diverse, and coherent topics. An efficient topic modeling technique should accurately identify flat and hierarchical topics, especially useful in disciplines where topics can be logically arranged into a tree format. In this paper, we propose Community Topic, a novel algorithm that exploits word co-occurrence networks to mine communities and produces topics. We also evaluate the proposed approach using several metrics and compare it with usual baselines, confirming its good performances. Community Topic enables quick identification of flat topics and topic hierarchy, facilitating the on-demand exploration of sub- and super-topics. It also obtains good results on datasets in different languages.
... The study also analyzed responses on school bus transportation and children's academic performance during the pandemic using the Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) topic modeling approach. GSDMM, also known as the Movie Group Process, is effective for shorter texts and converges rapidly (Mazarura and de Waal, 2016;Yin and Wang, 2014). The algorithm randomly assigns documents to K groups, increasing group sizes and ensuring common topics or words. ...
... To group comments that tend to describe the same topics there is a sub-field of machine learning called topic modelling which is a form of clustering. In the literature on COVID-19, several topic modelling algorithms have been used including: LIWC (Linguistic Inquiry and Word Count), is a text analysis software that relies on dictionaries to cluster comments [22]; LDA (Latent Dirichlet Allocation) a probabilistic model used to determine the hidden topics in the comment set [23]; GSDMM (Gibbs Sampling Dirichlet Mixture Model) a Dirichlet mixture model used to discover hidden topics in a corpus [24]; and finally BERT (Bidirectional Encoder Representations from Transformers) a pre-trained natural language processing model based on the Transformers architecture [25]. Thus, Table 3 presents the different methods listed with their specific use cases. ...
Article
Full-text available
This study undertakes a thorough analysis of the sentiment within the r/Corona-virus subreddit community regarding COVID-19 vaccines on Reddit. We meticulously collected and processed 34,768 comments, spanning from Novem-ber 20, 2020, to January 17, 2021, using sentiment calculation methods such as TextBlob and Twitter-RoBERTa-Base-sentiment to categorize comments into positive, negative, or neutral sentiments. The methodology involved the use of Count Vectorizer as a vectorization technique and the implementation of advanced ensemble algorithms like XGBoost and Random Forest, achieving an accuracy of approximately 80%. Furthermore, through the Dirichlet latent allocation, we identified 23 distinct reasons for vaccine distrust among negative comments. These findings are crucial for understanding the community's attitudes towards vaccination and can guide targeted public health messaging. Our study not only provides insights into public opinion during a critical health crisis, but also demonstrates the effectiveness of combining natural language processing tools and ensemble algorithms in sentiment analysis.
... where Mult is the multinomial distribution. A multinomial mixture model could be employed instead of a Dirichlet-multinomial mixture in applications where Dirichletmultinomial mixtures are used, such as short text clustering [22], [43], [71] and clustering genetic/biological data [27], [29], [47]. The two models each have their own advantages and disadvantages. ...
Article
Full-text available
Recent work has shown that finite mixture models with m components are identifiable, while making no assumptions on the mixture components, so long as one has access to groups of samples of size 2 m - 1 which are known to come from the same mixture component. In this work we generalize that result and show that, if every subset of k mixture components of a mixture model are linearly independent, then that mixture model is identifiable with only (2 m - 1)/( k - 1) samples per group. We further show that this value cannot be improved. We prove an analogous result for a stronger form of identifiability known as “determinedness” along with a corresponding lower bound. This independence assumption almost surely holds if mixture components are chosen randomly from a k -dimensional space. We describe some implications of our results for multinomial mixture models and topic modeling.
Article
Background. The access of Russian researchers to Scopus and Web of Science has become restricted, so the use of open reference databases becomes relevant. Objective. Identification of topical problems of energy transition in publications presented in Scilit, a content aggregator for scientific publications with free access. Materials and methods. The study utilized 10,121 bibliometric records of articles from 2019–2023. Publications were systematized using Gibbs sampling algorithm for Dirichlet mixture model. The topics of publications within the obtained clusters were analyzed using the demo version of the Carrot2 program. Publications were ranked using the sumy utility with the lex-rank algorithm. Results. The identified topical topics are devoted to systemic problems of energy complexes, including integration of different sources of energy generation, energy storage in “accumulators” or “green hydrogen” and optimization of their operation. Much attention is paid to the social aspects of the energy transition, especially relevant for rural areas and regions with a low level of economic development. Conclusions. Without financial support and appropriate infrastructure for local energy communities, the energy transition may be rejected by them. Households should be encouraged to use cleaner energy sources that are less harmful to health and the environment.
Article
Background Social media serves as a vast repository of data, offering insights into public perceptions and emotions surrounding significant societal issues. Amid the COVID-19 pandemic, long COVID (formally known as post–COVID-19 condition) has emerged as a chronic health condition, profoundly impacting numerous lives and livelihoods. Given the dynamic nature of long COVID and our evolving understanding of it, effectively capturing people’s sentiments and perceptions through social media becomes increasingly crucial. By harnessing the wealth of data available on social platforms, we can better track the evolving narrative surrounding long COVID and the collective efforts to address this pressing issue. Objective This study aimed to investigate people’s perceptions and sentiments around long COVID in Canada, the United States, and Europe, by analyzing English-language tweets from these regions using advanced topic modeling and sentiment analysis techniques. Understanding regional differences in public discourse can inform tailored public health strategies. Methods We analyzed long COVID–related tweets from 2021. Contextualized topic modeling was used to capture word meanings in context, providing coherent and semantically meaningful topics. Sentiment analysis was conducted in a zero-shot manner using Llama 2, a large language model, to classify tweets into positive, negative, or neutral sentiments. The results were interpreted in collaboration with public health experts, comparing the timelines of topics discussed across the 3 regions. This dual approach enabled a comprehensive understanding of the public discourse surrounding long COVID. We used metrics such as normalized pointwise mutual information for coherence and topic diversity for diversity to ensure robust topic modeling results. Results Topic modeling identified five main topics: (1) long COVID in people including children in the context of vaccination, (2) duration and suffering associated with long COVID, (3) persistent symptoms of long COVID, (4) the need for research on long COVID treatment, and (5) measuring long COVID symptoms. Significant concern was noted across all regions about the duration and suffering associated with long COVID, along with consistent discussions on persistent symptoms and calls for more research and better treatments. In particular, the topic of persistent symptoms was highly prevalent, reflecting ongoing challenges faced by individuals with long COVID. Sentiment analysis showed a mix of positive and negative sentiments, fluctuating with significant events and news related to long COVID. Conclusions Our study combines natural language processing techniques, including contextualized topic modeling and sentiment analysis, along with domain expert input, to provide detailed insights into public health monitoring and intervention. These findings highlight the importance of tracking public discourse on long COVID to inform public health strategies, address misinformation, and provide support to affected individuals. The use of social media analysis in understanding public health issues is underscored, emphasizing the role of emerging technologies in enhancing public health responses.
Article
Background To make the question text represent more information and construct an end-to-end text clustering model, we propose a double-target self-supervised clustering with multi-feature fusion (MF-DSC) for texts which describe questions related to the medical field. Since medical question-and-answer data are unstructured texts and characterized by short characters and irregular language use, the features extracted by a single model cannot fully characterize the text content. Methods Firstly, word weights were obtained based on term frequency, and word vectors were generated according to lexical semantic information. Then we fused term frequency and lexical semantics to obtain weighted word vectors, which were used as input to the model for deep learning. Meanwhile, a self-attention mechanism was introduced to calculate the weight of each word in the question text, i.e. , the interactions between words. To learn fusing cross-document topic features and build an end-to-end text clustering model, two target functions, L cluster and L topic, were constructed and integrated to a unified clustering framework, which also helped to learn a friendly representation that facilitates text clustering. After that, we conducted comparison experiments with five other models to verify the effectiveness of MF-DSC. Results The MF-DSC outperformed other models in normalized mutual information (NMI), adjusted Rand indicator (ARI) average clustering accuracy (ACC) and F1 with 0.4346, 0.4934, 0.8649 and 0.5737, respectively.
Article
Contemporary feminists utilize social media for activism, while backlashes come along. The gender-related discourses are often diminished when addressing public events regarding sexism and gender inequality on social media platforms. The dichotomous debate around the Tangshan beating incident in China epitomized how criminal interpretations of gender-related violence became a backlash against feminist expressions. By analyzing posts on Weibo using mixed methods, we describe the emerging discursive patterns around crime and gender, uncovering the gender-blind sexism that refutes feminist discourses on the social platform. We also highlight the critical hurdles facing grassroots feminist activism in Chinese cyberspace and propose implications for the design and research related to digital feminist activism.
Chapter
Meaning is a fundamental component of the social world (Luhmann, 1995; Schutz & Luckmann, 1973; Weber, 1946). People inhabiting the social world interpret the meaning of natural objects in their environment and social objects, including others, and act based on these interpretations. If we call the mechanism by which people interpret the meanings of objects and other people’s actions and link them to their own actions the meaning-making mechanism (Lamont, 2000), then social science, which aims to explain the behavior of people and groups, must, as its fundamental task, elucidate this meaning-making mechanism as its fundamental task.
Article
Full-text available
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organiza-tion, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.
Article
Full-text available
Finding the appropriate number of clusters to which documents should be partitioned is crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to discover the latent cluster structure based on the DPM model without requiring the number of clusters as input. Document features are automatically partitioned into two groups, in particular, discriminative words and irrelevant words, and contribute differently to document clustering. A variational inference algorithm is investigated to infer the document structure as well as the partition of document words at the same time. Our experiments indicate that our proposed approach performs well on the synthetic dataset as well as real datasets. The comparison between our approach and stage-of-the-art document clustering approaches shows that our approach is robust and effective for document clustering.
Conference Paper
Full-text available
http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf
Conference Paper
Full-text available
One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.
Conference Paper
Full-text available
The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. We derive a new fam- ily of distributions that are approximations to DCM distributions and constitute an exponen- tial family, unlike DCM distributions. We use these so-called EDCM distributions to obtain insights into the properties of DCM distribu- tions, and then derive an algorithm for EDCM maximum-likelihood training that is many times faster than the corresponding method for DCM distributions. Next, we investigate expectation- maximization with EDCM components and de- terministic annealing as a new clustering algo- rithm for documents. Experiments show that the new algorithm is competitive with the best meth- ods in the literature, and superior from the point of view of finding models with low perplexity.
Conference Paper
Full-text available
Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture bursti- ness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplex- ity. We also show using three standard docu- ment collections that the DCM leads to bet- ter classification than the multinomial model. DCM performance is comparable to that ob- tained with multiple heuristic changes to the multinomial model.
Conference Paper
Full-text available
Subscribers to the popular news or blog feeds (RSS/Atom) often face the problem of information overload as these feed sources usually deliver large number of items periodically. One solution to this problem could be clustering similar items in the feed reader to make the information more manageable for a user. Clustering items at the feed reader end is a challenging task as usually only a small part of the actual article is received through the feed. In this paper, we propose a method of improving the accuracy of clustering short texts by enriching their representation with additional features from Wikipedia. Empirical results indicate that this enriched representation of text items can substantially improve the clustering accuracy when compared to the conventional bag of words representation.
Conference Paper
Full-text available
We compare various document clustering techniques including K-means, SVD-based method and a graph-based approach and their performance on short text data collected from Twitter. We define a measure for evaluating the cluster error with these techniques. Observations show that graph-based approach using affinity propagation performs best in clustering short text data with minimal cluster error.
Conference Paper
Full-text available
We present V-measure, an external entropybased cluster evaluation measure. V-measure provides an elegant solution to many problems that affect previously defined cluster evaluation measures including 1) dependence on clustering algorithm or data set, 2) the "problem of matching", where the clustering of only a portion of data points are evaluated and 3) accurate evaluation and combination of two desirable aspects of clustering, homogeneity and completeness. We compare V-measure to a number of popular cluster evaluation measures and demonstrate that it satisfies several desirable properties of clustering solutions, using simulated clustering results. Finally, we use V-measure to evaluate two clustering tasks: document clustering and pitch accent type clustering.
Article
Full-text available
Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. We perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.
Article
The ever increasing activity in social networks is mainly manifested by a growing stream of status updating or microblog-ging. The massive stream of updates emphasizes the need for accurate and efficient clustering of short messages on a large scale. Applying traditional clustering techniques is both inaccurate and inefficient due to sparseness. This paper presents an accurate and efficient algorithm for clustering Twitter tweets. We break the clustering task into two distinctive tasks/stages: (1) batch clustering of user annotated data, and (2) online clustering of a stream of tweets. In the first stage we rely on the habit of 'tagging', common in social media streams (e.g. hashtags), thus the algorithm can bootstrap on the tags for clustering of a large pool of hashtagged tweets. The stable clusters achieved in the first stage lend themselves for online clustering of a stream of (mostly) tagless messages. We evaluate our results against gold-standard classification and validate the results by employing multiple clustering evaluation measures (information theoretic, paired, F and greedy). We compare our algorithm to a number of other clustering algorithms and various types of feature sets. Results show that the algorithm presented is both accurate and efficient and can be easily used for large scale clustering of sparse messages as the heavy lifting is achieved on a sublinear number of documents. Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
With the explosive growth of microblogging services, short-text messages (also known as tweets) are being created and shared at an unprecedented rate. Tweets in its raw form can be incredibly informative, but also overwhelming. For both end-users and data analysts it is a nightmare to plow through millions of tweets which contain enormous noises and redundancies. In this paper, we study continuous tweet summarization as a solution to address this problem. While traditional document summarization methods focus on static and small-scale data, we aim to deal with dynamic, quickly arriving, and large-scale tweet streams. We propose a novel prototype called Sumblr (SUMmarization By stream cLusteRing) for tweet streams. We first propose an online tweet stream clustering algorithm to cluster tweets and maintain distilled statistics called Tweet Cluster Vectors. Then we develop a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Finally, we describe a topic evolvement detection method, which consumes online and historical summaries to produce timelines automatically from tweet streams. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our approach.
Article
Presents parameter estimation methods common with discrete proba-bility distributions, which is of particular interest in text modeling. Starting with maximum likelihood, a posteriori and Bayesian estimation, central concepts like conjugate distributions and Bayesian networks are reviewed. As an application, the model of latent Dirichlet allocation (LDA) is explained in detail with a full derivation of an approximate inference algorithm based on Gibbs sampling, in-cluding a discussion of Dirichlet hyperparameter estimation. Finally, analysis methods of LDA models are discussed.
Article
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Article
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.
Conference Paper
We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast ε-accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
In a document retrieval, or other pattern matching environment where stored entities (documents) are compared with each other or with incoming patterns (search requests), it appears that the best indexing (property) space is one where each entity lies as far away from the others as possible; in these circumstances the value of an indexing system may be expressible as a function of the density of the object space; in particular, retrieval performance may correlate inversely with space density. An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents. Typical evaluation results are shown, demonstating the usefulness of the model.
Book
Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.
Article
Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.
Article
Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters neccessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historial perspective rooted in mathematics, statistics and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters in unsupervised learning and the resulting system represents a data concept. From a practicual perspective clustering plays an outstanding role in data mining applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition and machine learning. This survery focuses on clustering in data ming. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey.
Article
Recent approaches to text classification have used two di#erent first-order probabilistic models for classification, both of which make the naive Bayes assumption.
Article
There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification.
Data clustering: 50 years beyond K-means, Pattern Recognition Letters, v.31 n
  • K Anil
  • Jain
On spectral clustering: Analysis and an algorithm Advances in neural information processing systems
  • A Y Ng
  • M I Jordan
  • Y Weiss
Clustering microtext streams for event identification
  • J Yin
Parameter estimation for text analysis. Technical report version 2.9 vsonix GmbH and University of Leipzig
  • G Heinrich