Article

Investigating task performance of probabilistic topic models: An empirical study of PLSA and LDA

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Applications of topic modelling include document representation and collection summarization; it has not been widely used for collection partitioning, but note that we do not explore that problem here. Topic models have however been used to improve the performance of clustering in a variety of ways [9,22,33]. ...
... Combined methods. Lu et al. [22] explored an approach to integration of topic modelling with clustering. They compared the performance of two topic modelling methods, pLSA and LDA, in the context of document clustering, considering two ways in which topic modelling and clustering can interact. ...
... Contrasting these approaches, in the work of Lu et al. [22] clusters are identified by most significant topic and therefore the number of clusters is naturally equivalent to the number of topics. In the work of Xie and Xing [33], clusters are considered as mixtures of multiple local topics and global topics are mixtures of clusters. ...
... Topic coherence and perplexity scores are the main techniques used to estimate the number of topics. The α and β parameters were tested with a value recommended in the literature for short texts (50/K) (Lu et al., 2011;Steyvers & Griffiths, 2007). Another critical parameter of the LDA model, K, which reflects the number of subjects, was also used in the model fitting. ...
... Another critical parameter of the LDA model, K, which reflects the number of subjects, was also used in the model fitting. When the K parameter is increased, finer-grained topics are created, but when it is decreased, coarser-grained topics are created (Jelodar et al., 2019;Lu et al., 2011). To decide on the suitable or optimal number of topics (K), the LDA model was built and explored with varied K-values (5, 6, 7, and 60) for defined α and β values. ...
Article
The literature of games in education has a rich and multidisciplinary content. Due to the large number of studies in the field, it is not easy to analyze all relevant studies. There are few studies exploring the big picture of research trends in the field. For this reason, the purpose of this study is to examine longitudinal trends of game-based research in education using text mining techniques. 4980 publications were retrieved as an experimental dataset indexed by the SCOPUS database in the period 1967 to mid-2021. The results include descriptive statistics of game-based research, trends of the research topics, and trends in the frequency of each topic over time. They show that the number of studies focusing on the use of games in education has increased, particularly since the 2000s when internet use accelerated and became widespread. Approximately 70% of all the studies were conducted in the last 10 years. One third of the studies is related to the main topic of game-based learning. It is significant that in the last three decades the topic of serious games has been among the top three trends. Considering usage acceleration of the topics, the highest values belong to game-based learning, serious games and student science games, in that order. The findings of this study are expected to guide the field by providing a better understanding of the trends of games in education and offer a direction for future research.
... A wide range of approaches has been proposed for document clustering, including hierarchical methods [33], partitional methods [34], spectral methods [35], matrix factorization [36,37] and topic models [38]. ...
... Two seminal topic models [22,16] are exploited for document clustering in [38], where topics play the role of clusters and each text document is placed inside the most relevant cluster. ...
Article
Topic modeling can be unified synergically with document clustering. In this manuscript, we propose two innovative unsupervised approaches for the combined modeling and interrelated accomplishment of the two tasks. Both approaches rely on respective Bayesian generative models of topics, contents and clusters in textual corpora. Such models treat topics and clusters as linked latent factors in document wording. In particular, under the generative model of the second approach, textual documents are characterized by topic distributions, that are allowed to vary around the topic distributions of their membership clusters. Within the devised models, algorithms are designed to implement Rao-Blackwellized Gibbs sampling together with parameter estimation. These are derived mathematically for carrying out topic modeling with document clustering in a simultaneous and interrelated manner. A comparative empirical evaluation demonstrates the effectiveness of the presented approaches, over different families of state-of-the-art competitors, in clustering real-world benchmark text collections and, also, uncovering their underlying semantics. Besides, a case study is developed as an insightful qualitative analysis of results on real-world text corpora.
... For topic modeling studies based on textual content analysis, the creation of an empirical corpus is usually one of the most critical steps [29], [30]. The selection of the approach to be applied in the corpus creation procedure usually has a direct influence on the results [28], [29]. ...
... Moreover, LDA algorithm provides many efficient methods for calculating the coherence score for the estimation of the optimal number of topics and is widely applied in many fields [34], [43]. For these reasons, LDA is a highly preferred and accepted algorithm for the semantic content analysis of large textual corpora because of the systematic approaches it provides [28], [30], [39]. ...
Article
Full-text available
Bioinformatics, which has developed rapidly in recent years with the collaborative contributions of the fields of biology and informatics, provides a deeper perspective on the analysis and understanding of complex biological data. In this regard, bioinformatics has an interdisciplinary background and a rich literature in terms of domain-specific studies. Providing a holistic picture of bioinformatics research by analyzing the major topics and their trends and developmental stages is critical for an understanding of the field. From this perspective, this study aimed to analyze the last 50 years of bioinformatics studies (a total of 71,490 articles) by using an automated text-mining methodology based on probabilistic topic modeling to reveal the main topics, trends, and the evolution of the field. As a result, 24 major topics that reflect the focuses and trends of the field were identified. Based on the discovered topics and their temporal tendencies from 1970 until 2020, the developmental periods of the field were divided into seven phases, from the “newborn” to the “wisdom” stages. Moreover, the findings indicated a recent increase in the popularity of the topics “Statistical Estimation”, “Data Analysis Tools”, “Genomic Data”, “Gene Expression”, and “Prediction”. The results of the study revealed that, in bioinformatics studies, interest in innovative computing and data analysis methods based on artificial intelligence and machine learning has gradually increased, thereby marking a significant improvement in contemporary analysis tools and techniques based on prediction.
... Topic models have been successfully applied in Natural Language Processing with various applications such as information extraction, text clustering, summarization, and sentiment analysis [1][2][3][4][5][6]. The most popular conventional topic model, Latent Dirichlet Allocation [7], learns document-topic and topic-word distribution via Gibbs sampling and mean field approximation. ...
... Our goal is to learn a mapping function f θ : R V → R T of the encoder θ which transforms x to the latent distribution z (x − and x + are transformed to z − and z + , respectively). A reasonable mapping function must fulfill two qualities: (1) x and x + are mapped onto nearby positions; (2) x and x − are projected distantly. Regarding goal (1) as the main objective and goal (2) as the constraint enforcing the model to learn the relations among dissimilar samples, we specify the constrained optimization problem, in which denotes the strength of the constraint ...
Preprint
Full-text available
Recent empirical studies show that adversarial topic models (ATM) can successfully capture semantic patterns of the document by differentiating a document with another dissimilar sample. However, utilizing that discriminative-generative architecture has two important drawbacks: (1) the architecture does not relate similar documents, which has the same document-word distribution of salient words; (2) it restricts the ability to integrate external information, such as sentiments of the document, which has been shown to benefit the training of neural topic model. To address those issues, we revisit the adversarial topic architecture in the viewpoint of mathematical analysis, propose a novel approach to re-formulate discriminative goal as an optimization problem, and design a novel sampling method which facilitates the integration of external variables. The reformulation encourages the model to incorporate the relations among similar samples and enforces the constraint on the similarity among dissimilar ones; while the sampling method, which is based on the internal input and reconstructed output, helps inform the model of salient words contributing to the main topic. Experimental results show that our framework outperforms other state-of-the-art neural topic models in three common benchmark datasets that belong to various domains, vocabulary sizes, and document lengths in terms of topic coherence.
... Moreover, it is possible to employ the fitted model to analyze a new document/ text [4]. Also, LDA is useful for large scale corpus, making it suitable for exploratory literature review on the bulk of research articles [8] [11]. ...
... The paper used LDA and its Mallet wrapper to build the models. The inherent advantage of LDA lies in its capability to handle large scale corpus efficiently [11]. Since the management research is interdisciplinary in nature, a fullfledged well-defined literature review requires analysis of many research contributions. ...
... To understand the topics inside the bullet chats, we perform topic modeling for each platform. Following the previous study (Lu, Mei, and Zhai 2011), we randomly sample 5% bullet chats and leverage BERTopic (Grootendorst 2022) to extract topics from them with the default settings. The model produces 5,119 and 2,872 topics for Bilibili and Huya, respectively. ...
Article
Esports, short for electronic sports, is a form of competition using video games and has attracted more than 530 million audiences worldwide. To watch esports, people utilize online livestreaming platforms. Recently, a novel interaction method, namely "bullet chats," has been introduced on these platforms. Different from conventional comments, bullet chats are scrolling comments posted by audiences that are synchronized to the livestreaming timeline, enabling audiences to share and communicate their immediate perspectives. The real-time nature of bullet chats, therefore, brings a new perspective to esports analysis. In this paper, we conduct the first empirical study on the bullet chats for esports, focusing on one of the most popular video games, i.e., League of Legends (LoL). Specifically, we collect 21 million bullet chats of LoL from Jan. 2023 to Mar. 2023 across two mainstream platforms (Bilibili and Huya). By performing quantitative analysis, we reveal how the quantity and toxicity of bullet chats distribute (and change) w.r.t. three aspects, i.e., the season, the team, and the match. Our findings show that teams with higher rankings tend to attract a greater quantity of bullet chats, and these chats are often characterized by a higher degree of toxicity. We then utilize topic modeling to identify topics among bullet chats. Interestingly, we find that a considerable portion of topics (14.14% on Bilibili and 22.94% on Huya) discuss themes beyond the game, including genders, entertainment stars, non-esports athletes, and so on. Besides, by further modeling topics on toxic bullet chats, we find hateful speech targeting different social groups, ranging from professions, regions, etc. To the best of our knowledge, this work is the first measurement of bullet chats on esports livestreaming. We believe our study can shed light on esports research from the perspective of bullet chats.
... Text clustering is an important problem in the field of natural language processing [36]. Clustering short texts is challenging due to the extremely low amount of word co-occurrence within such texts [20], this poses difficulties when employing the traditional topic modeling algorithms, such as probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA) [37], as these algorithms were primarily designed for long texts [5]. ...
Article
Full-text available
Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.
... Topic modeling, a popular text-mining approach, can analyze millions of documents in minutes (Karami 2020). LDA is a statistical distribution-based topic model (Lu et al., 2011). LDA presupposes word-and-document interchange in a bag-of-words corpus. ...
Article
The recent advent of ChatGPT has stirred substantial attention and debates, potentially altering the dynamics across various industries, notably marketing. This pioneering study delves into the public reactions and applications of ChatGPT within marketing realms. Leveraging a text-mining methodology, a corpus of over 600,000 tweets harvested before and after ChatGPT’s launch from January 2021 to April 2023 was scrutinized to gauge public sentiment towards AI-incorporated tools in marketing, and to unearth the predominant themes within public discourse. Initial findings unveiled a buoyant public sentiment towards AI-facilitated tools, which however, ebbed in January 2023, driven by apprehensions regarding AI technology’s limitations and potential perils. Subsequent months witnessed a rebound in sentiment, stabilizing above the positive threshold, as Twitter users increasingly acknowledged ChatGPT’s prospective merits on employment and daily lives. However, compared to the period before the introduction of ChatGPT, there has been a decline in the general public’s sentiment towards AI in marketing. Furthermore, the analysis discerned a convergence in the core topics broached by the public concerning AI and ChatGPT’s ramifications on marketing. While the automation of mundane tasks and heightened customer experience were lauded, trepidations surrounding job displacement and the ethical quandaries of supplanting human labor with machines surfaced. This exposition recommends that enterprises meticulously assess the prospective impact of AI on their personnel, advocating for the judicious and ethical deployment of such emergent technologies.
... Based on the available research, the LDA model has proven to be superior to the pLSA model in some respects. First, the LDA approach can eliminate overfitting problems and compute scalable fine-grained, low dimensional semantic representations [21]. Second, LDA can enhance the capture of interchangeability between mixed model words and documents [22]. ...
Article
Full-text available
We quantify the impact of technological innovation factors on university patent transferability, accurately identify transferable patents, and address the lack of interpretability in existing patent transferability models by applying the latent Dirichlet allocation (LDA) model to conduct text mining and feature extraction on abstracts of university patents in the field of artificial intelligence. We then construct a patent transferability fusion index system that includes technological innovation features and quality features. Four typical machine learning algorithms, namely support vector machine (SVM), random forest (RF), artificial neural network (ANN), and extreme gradient boosting (XGBoost) were used to predict university patent transferability. We use SHapley Additive exPlanations (SHAP) to explore feature importance and interactions based on the model with the strongest performance. Our results show that (1) XGBoost outperforms the other algorithms in predicting university patent transferability; (2) fusion indicators can effectively improve prediction performance with respect to university patent transferability; (3) the importance of technological innovation features generated with XGBoost is generally high; and (4) the impact of both technology innovation and patent quality features on university patent transferability is nonlinear and there are significant positive interaction effects between them.
... As a text mining technique, LDA [34] has been widely used as a probabilistic model in the analysis of a corpus of documents [61]. During the modeling process in LDA, this technique assigns high probabilities to similar documents and members within the corpus [62]. ...
Article
Full-text available
Customer Relationship Management (CRM) is a method of management that aims to establish, develop, and improve relationships with targeted customers in order to maximize corporate profitability and customer value. There have been many CRM systems in the market. These systems are developed based on the combination of business requirements, customer needs, and industry best practices. The impact of CRM systems on the customers' satisfaction and competitive advantages as well as tangible and intangible benefits are widely investigated in the previous studies. However, there is a lack of studies to assess the quality dimensions of these systems to meet an organization's CRM strategy. This study aims to investigate customers' satisfaction with CRM systems through online reviews. We collected 5172 online customers' reviews from 8 CRM systems in the Google play store platform. The satisfaction factors were extracted using Latent Dirichlet Allocation (LDA) and grouped into three dimensions; information quality, system quality, and service quality. Data segmentation is performed using Learning Vector Quantization (LVQ). In addition, feature selection is performed by the entropy-weight approach. We then used the Adaptive Neuro Fuzzy Inference System (ANFIS), the hybrid of fuzzy logic and neural networks, to assess the relationship between these dimensions and customer satisfaction. The results are discussed and research implications are provided.
... Due to these advantages, LDA stands as a highly favored and widely accepted algorithm for conducting semantic content analysis on extensive textual corpora. Its systematic approach and versatility have made it a staple choice in numerous research endeavors [53], [54]. ...
Article
Full-text available
Gamification holds significant importance as an efficacious means to motivate individuals, stimulate their engagement, and foster desired behaviors. There is an increasing interest among researchers in exploring the domain of gamification. Consequently, it becomes crucial to identify specific research trends within this field. This study employs a comprehensive analysis of 4743 articles sourced from the Scopus database, utilizing the topic modeling approach, with the objective of discerning research patterns and trends within the gamification domain. The findings revealed the existence of thirteen distinct topics within the field. Notably, "Health training," "Enhancing learning with technology," and "Game design framework" emerged as the most prominent topics, based on their frequency of research publications and popularity. This study serves as a valuable resource for researchers and practitioners seeking to stay abreast of the latest advancements in gamification. The identified issues through topic modeling can be employed to identify gaps in current research and potential directions for future research endeavors.
... Topic models have been widely used to identify human-interpretable topics and learn text representations, which have been applied for various tasks in Natural Language Processing (NLP) such as information retrieval (Lu et al., 2011), summarization , and semantic similarity detection (Peinelt et al., 2020). A typical topic models is based on the latent Dirichlet allocation (LDA) (Blei et al., 2003) and Bayesian inference. ...
... Yi et al. [52] conducted a comparative study of various topic modeling methods, including LDA, PLSA, and LSI, and found that LDA generally performed best. Similarly, Lu et al. [53] compared LDA and PLSA in an empirical study, and their results also favored LDA for most tasks. Atreya and Elkan [54] demonstrated the limitations of Latent Semantic Indexing (LSI) for TREC collections and proposed a new retrieval model that utilizes word co-occurrence statistics to estimate document similarity. ...
Article
Full-text available
This paper provides an extensive and thorough overview of the models and techniques utilized in the first and second stages of the typical information retrieval processing chain. Our discussion encompasses the current state-of-the-art models, covering a wide range of methods and approaches in the field of information retrieval. We delve into the historical development of these models, analyze the key advancements and breakthroughs, and address the challenges and limitations faced by researchers and practitioners in the domain. By offering a comprehensive understanding of the field, this survey is a valuable resource for researchers, practitioners, and newcomers to the information retrieval domain, fostering knowledge growth, innovation, and the development of novel ideas and techniques.
... The main advantage of LDA is the ability to generate new documents. LDA is also considered to be more robust to over-learning, although there is no definitive clarity on this issue (Lu, 2011). Most topic models today are based on LDA. ...
... It was not the first method for topic modelling, but since the introduction of LDA most models are adaptations of it [9]. Examples of recent advances in topic modelling include a variety of extensions of LDA such as neural models that were developed to facilitate LDA's robustness in retrieval tasks [6,17,19,30,33]. ...
Conference Paper
Full-text available
Topic modelling is an approach to generation of descriptions of document collections as a set of topics where each has a distinct theme and documents are a blend of topics. It has been applied to retrieval in a range of ways, but there has been little prior work on measurement of whether the topics are descriptive in this context. Moreover, existing methods for assessment of topic quality do not consider how well individual documents are described. To address this issue we propose a new measure of topic quality, which we call specificity; the basis of this measure is the extent to which individual documents are described by a limited number of topics. We also propose a new experimental protocol for validating topic-quality measures, a 'noise dial' that quantifies the extent to which the measure's scores are altered as the topics are degraded by addition of noise. The principle of the mechanism is that a meaningful measure should produce low scores if the 'topics' are essentially random. We show that specificity is at least as effective as existing measures of topic quality and does not require external resources. While other measures relate only to topics, not to documents, we further show that specificity correlates to the extent to which topic models are informative in the retrieval process. CCS CONCEPTS • Information systems → Document topic models.
... Topic models are often used by researchers to improve text representation and indexing in firststage retrieval systems [1,97,100,101]. For example, [95] introduces the concept of a generalized vector space model. This year, various topic models are often applied to first-stage retrieval. ...
Conference Paper
In this paper, the first-stage retrieval technology is studied from four aspects: the development background, the frontier technology, the current challenges, and the future directions. Our contributionconsists of two main parts. On the one hand, this paper reviewed some retrieval techniques proposed by researchers and drew targeted conclusions through comparative analysis. On the other hand, dif erent research directions are discussed, and the impact of the combination of dif erent techniques on first-stage retrieval is studied and compared. In this way, this survey provides a comprehensive overview of the fieldand will hopefully be used by researchers and practitioners in the first-stage retrieval domain, inspiringnew ideas and further developments.
... Probabilistic Latent Semantic Analysis (pLSA) was proposed to resolve the depiction test in LSA by substituting Singular Value Decomposition (SVD) with a probabilistic model [13]. It shows every record in the TF-IDF matrix using probability. ...
Article
Full-text available
Topic modeling is a powerful technique for uncovering hidden patterns in large documents. It can identify themes that are highly connected and lead to a certain region while accounting for temporal and spatial complexity. In addition, sentiment analysis can determine the sentiments of media articles on various issues. This study proposes a two-stage natural language processing-based model that utilizes Latent Dirichlet Allocation to identify critical topics related to each type of legal case or judgment and the Valence Aware Dictionary Sentiment Reasoner algorithm to assess people's sentiments on those topics. By applying these strategies, this research aims to influence public perception of controversial legal issues. This study is the first of its kind to use topic modeling and sentiment analysis on Indian legal documents and paves the way for a better understanding of legal documents.
... [55] compares the performance of various topic models for information retrieval, including LDA and PLSA. [56] provides a comparison of the performance of PLSA and LDA on several benchmark datasets. However, not all studies have found topic models to be effective for information retrieval. ...
Preprint
In this paper, we provide a detailed overview of the models used for information retrieval in the first and second stages of the typical processing chain. We discuss the current state-of-the-art models, including methods based on terms, semantic retrieval, and neural. Additionally, we delve into the key topics related to the learning process of these models. This way, this survey offers a comprehensive understanding of the field and is of interest for for researchers and practitioners entering/working in the information retrieval domain.
... Depending on the generative procedure and statistical assumptions employed in the topic modeling algorithm, both P(w|a k ) and P(a k |r ) may change. Among the various topic modeling algorithms, pLSA (Hofmann, 2001) and LDA (Blei et al., 2003) are widely utilized in the literature (Lu et al., 2011). Therefore, we use these algorithms for identifying the topics and the associated keywords. ...
Article
Full-text available
Driven by the fierce competition in the airline industry, carriers strive to increase their customer satisfaction by understanding their expectations and tailoring their service offerings. Due to the explosive growth of social media usage, airlines have the opportunity to capitalize on the abundantly available online customer reviews (OCR) to extract key insights about their services and competitors. However, the analysis of such unstructured textual data is complex and time-consuming. This research aims to automatically and efficiently extract airline-specific intelligence (i.e., passenger-perceived strengths and weaknesses) from OCR. Topic modeling algorithms are employed to discover the prominent service quality aspects discussed in the OCR. Likewise, sentiment analysis methods and collocation analysis are used to classify review sentence sentiment and ascertain the major reasons for passenger satisfaction/dissatisfaction, respectively. Subsequently, an ensemble-assisted topic model (EA-TM) and sentiment analyzer (E-SA) is proposed to classify each review sentence to the most representative aspect and sentiment. A case study involving 398,571 airline review sentences of a US-based target carrier and four of its competitors is used to validate the proposed framework. The proposed EA-TM and E-SA achieved 17–23% and 9–20% higher classification accuracy over individual benchmark models, respectively. The results reveal 11 different aspects of airline service quality from the OCR, airline-specific sentiment summary towards each aspect, and root causes for passenger satisfaction/dissatisfaction for each identified topic. Finally, several theoretical and managerial implications for improving airline services are derived based on the results.
... They focused on which component highly corresponds to terms concerning its function in the issue report. Therefore, they constructed a semantic analysis model with issue reports as a document and the component as a category, as Lu et al. [25] proposed. The experiment achieved the top-k results from the title and description dataset, consisting of 6,000 issue reports for ten components. ...
Article
Full-text available
Various issues or bugs are reported during the software development. It takes considerable effort, time, and cost for the software developers to triage these issues manually. Many previous studies have proposed various method to automate the triage process by predicting component using word-based language models. However, these methods still suffer from unsatisfactory performance due to their structural limitations and ignorance of the word context. In this paper, we propose a novel technique based on pretrained language models and it aims to predict a component of an issue report. Our approach fine-tunes the pretrained language models to conduct multilabel classifications. The proposed approach outperforms the previous state-of-the-art method by more than 30% with respect to the recall at k on all the datasets considered in our experiment. This improvement suggests that fine-tuned pretrained language models can help us to predict issue components effectively.
... Those methods transform unstructured text into numerical data, so the large-scale text data could be structured, classified, or clustered automatically. Although such representations are feasible for tasks such as document summarization [8,9], document classification [10][11][12] and clustering [13], they suffer from dimensional sparsity, which leads to high computational cost, and also miss contextual information in the text sequences [14]. ...
Article
Full-text available
Neural networks, primarily recurrent and convolutional Neural networks, have been proven successful in text classification. However, convolutional models could be limited when classification tasks are determined by long-range semantic dependency. While the recurrent ones can capture long-range dependency, the sequential architecture of which could constrain the training speed. Meanwhile, traditional networks encode the entire document in a single pass, which omits the hierarchical structure of the document. To address the above issues, this study presents T-HMAN, a Topic-aware Hierarchical Multiple Attention Network for text classification. A multi-head self-attention coupled with convolutional filters is developed to capture long-range dependency via integrating the convolution features from each attention head. Meanwhile, T-HMAN combines topic distributions generated by Latent Dirichlet Allocation (LDA) with sentence-level and document-level inputs respectively in a hierarchical architecture. The proposed model surpasses the accuracies of the current state-of-the-art hierarchical models on five publicly accessible datasets. The ablation study demonstrates that the involvement of multiple attention mechanisms brings significant improvement. The current topic distributions are fixed vectors generated by LDA, the topic distributions will be parameterized and updated simultaneously with the model weights in future work.
... Analyzing over one million tweets would have required a substantial amount of human effort. Computationally, LDA performs the process exponentially faster while addressing issues of sparsity related to text mining [77]. ...
Article
Full-text available
Natural language processing techniques have increased the volume and variety of text data that can be analyzed. The aim of this study was to identify the positive and negative topical sentiments among diet, diabetes, exercise, and obesity tweets. Using a sequential explanatory mixed-method design for our analytical framework, we analyzed a data corpus of 1.7 million diet, diabetes, exercise, and obesity (DDEO)-related tweets collected over 12 months. Sentiment analysis and topic modeling were used to analyze the data. The results show that overall, 29% of the tweets were positive, and 17% were negative. Using sentiment analysis and latent Dirichlet allocation (LDA) topic modeling, we analyzed 800 positive and negative DDEO topics. From the 800 LDA topics—after the qualitative and computational removal of incoherent topics—473 topics were characterized as coherent. Obesity was the only query health topic with a higher percentage of negative tweets. The use of social media by public health practitioners should focus not only on the dissemination of health information based on the topics discovered but also consider what they can do for the health consumer as a result of the interaction in digital spaces such as social media. Future studies will benefit from using multiclass sentiment analysis methods associated with other novel topic modeling approaches.
... LDA models the original text using the bag-of-words model, obtaining a probability distribution of text-implicit topics and characterizing the text with a vector of text in the implicit topic space, which produces better results for long texts. Lu et al. [6][7][8] improve the LDA model to enhance the topic model's textual representation. Muhammad et al. [9] develop sentiment shift metrics based on contextual information such as negation, great celebration, and decay, as well as text features such as unigrams and bigrams. ...
Article
Full-text available
Emotional tracking on time-varying virtual space communication aims to identify sentiments and opinions expressed in a piece of user-generated content. However, the existing research mainly focuses on the user’s single post, despite the fact that social network data are sequential. In this article, we propose a sentiment analysis model based on time series prediction in order to understand and master the chronological evolution of the user’s point of view. Specifically, with the help of a domain-knowledge-enhanced pre-trained encoder, the model embeds tokens for each moment in the text sequence. We then propose an attention-based temporal prediction model to extract rich timing information from historical posting records, which improves the prediction of the user’s current state and personalizes the analysis of user’s sentiment changes in social networks. The experiments show that the proposed model improves on four kinds of sentiment tasks and significantly outperforms the strong baseline.
... In vocabulary, every topic is described as a multinomial distribution over the/v/words. e texts are created by sampling a mixture of these latent topics and then sampling words from that sampling mixture [31,32]. ...
Article
Full-text available
Text mining, also known as text analysis, is the process of converting unstructured text data into meaningful and functional information. Text mining uses different AI technologies to automate data and generate valuable insights, allowing enterprises to make data-based decisions. Text mining enables the user to extract important content from text data sets. Text analysis encourages machine learning ability for research areas such as medical and pharmaceutical innovation fields. Apart from this, text analysis converts inaccessible data into a structured format, which can be used for further analysis. Text analysis emphasizes facts and relationships from large data sets. This information is extracted and converted into structured data for visualization, analysis, and integration as structured data and refines the information using machine-learning methods. Like most things related to Natural Language Processing, text mining can seem like a difficult concept to understand. But the fact is, it does not have to be. This research article will go through the basics of text mining, clarify its different methods and techniques, and make it easier to understand how it works. We implemented Latent Dirichlet Allocation techniques for mining the data from the data set; it works properly and will be in future development data mining techniques.
... Our models use mixture of Gaussians to model local patches and the codebook is learned on-line. LDA (Lu, Mei, and Zhai 2011) can be used for clustering by treating each topic as a cluster. An image is assigned to cluster x if x = argmax k θ k , where θ is the topic proportion vector of the image. ...
Article
Image clustering and visual codebook learning are two fundamental problems in computer vision and they are tightly related. On one hand, a good codebook can generate effective feature representations which largely affect clustering performance. On the other hand, class labels obtained from image clustering can serve as supervised information to guide codebook learning. Traditionally, these two processes are conducted separately and their correlation is generally ignored.In this paper, we propose a Double Layer Gaussian Mixture Model (DLGMM) to simultaneously perform image clustering and codebook learning. In DLGMM, two tasks are seamlessly coupled and can mutually promote each other. Cluster labels and codebook are jointly estimated to achieve the overall best performance. To incorporate the spatial coherence between neighboring visual patches, we propose a Spatially Coherent DLGMM which uses a Markov Random Field to encourage neighboring patches to share the same visual word label.We use variational inference to approximate the posterior of latent variables and learn model parameters.Experiments on two datasets demonstrate the effectiveness of two models.
... In fact, during the computation of the word-document matrix, all of the textual elements are randomly mixed to carry out the required statistical processing and analysis. As such, strategies such as the Latent Semantic Analysis (PLSA) and LDA are based on the assumption of the exchangeability of words and textual instances [12]. ...
Article
Full-text available
Education quality has become an important issue and has received considerable attention around the world, especially due to its relevant repercussions on the socio-economical development of society. In recent years, many nations have realized the need for a highly skilled workforce to thrive in the emerging knowledge-based economy. They have consequently adopted strategies to identify the lines of action to improve the education quality. In response to the government’s efforts to improve the education quality in Colombia, this study examines the current perceptions of the education system from the perspective of key local stakeholders. Therefore, we used a survey that contained open-ended questions to collect information about the limitations and difficulties of the education process for several groups of participants. The collected answers were categorized into a variety of topics using a Latent Dirichlet Allocation based model. Consequently, the students’, teachers’ and parents’ answers were analyzed separately to obtain a general landscape of the perceptions of the education system. Evaluation metrics, such as topic coherence, were quantitatively analyzed to assess the modelling performance. In addition, a methodology for the hyper-parameters setting and the final topic labelling was presented. The results suggest that topic modelling strategies are a viable alternative to identify strategic lines of action and to obtain a macro-perspective of the perceptions of the education system.
... Few studies compare topic modeling algorithms according to their actual accuracy, and their findings are mixed [68]. Compare the task performance of PLSA and LDA using two datasets. ...
Article
Full-text available
Topic modeling is a popular technique for exploring large document collections. It has proven useful for this task, but its application poses a number of challenges. First, the comparison of available algorithms is anything but simple, as researchers use many different datasets and criteria for their evaluation. A second challenge is the choice of a suitable metric for evaluating the calculated results. The metrics used so far provide a mixed picture, making it difficult to verify the accuracy of topic modeling outputs. Altogether, the choice of an appropriate algorithm and the evaluation of the results remain unresolved issues. Although many studies have reported promising performance by various topic models, prior research has not yet systematically investigated the validity of the outcomes in a comprehensive manner, that is, using more than a small number of the available algorithms and metrics. Consequently, our study has two main objectives. First, we compare all commonly used, non-application-specific topic modeling algorithms and assess their relative performance. The comparison is made against a known clustering and thus enables an unbiased evaluation of results. Our findings show a clear ranking of the algorithms in terms of accuracy. Secondly, we analyze the relationship between existing metrics and the known clustering, and thus objectively determine under what conditions these algorithms may be utilized effectively. This way, we enable readers to gain a deeper understanding of the performance of topic modeling techniques and the interplay of performance and evaluation metrics.
... To acquire ideas for social and policy issues, it effectively extracts large interpretable topics from a vast collection of data, such as unstructured documents, reports, survey results, discussions, and statistics. Compared to other topic modeling techniques, LDA techniques readily interpret results and solve overfitting issues, which is advantageous in deriving multiple topics from vast amounts of unstructured data [118]. ...
Article
Full-text available
The world is now strengthening its Information and Communication Technology (ICT) capabilities to secure economic growth and national competitiveness. The role of ICT is important for problems like COVID-19. ICT based innovation is effective in responding to problems for industry, economy, and society. However, we need to understand, not from the perspective of performance or investment, that the use and performance of ICT technology are promoted when each country’s ICT related environment, policies, governance, and regulations are effective. We need to share sustainable ICT experiences, successes, and challenges to solve complex problems and reorganize policies. This study proposes a Text Mining methodology from a future-oriented perspective to extract semantic system patterns from International Telecommunication Union (ITU) professional reports. In the text extracted from the report, we found a new relationship pattern and a potential topic. The research results provide insights into a diverse perspective for policymakers to search for successful ICT strategies.
... Conventional topic models, such as PLSA and LDA, are among the most popular techniques for discovering topic words within a document (Zan et al., 2007;Lu et al., 2011). Since MKEs belong to short texts, applying these conventional topic models directly to such short texts usually does not achieve the expected performance. ...
Article
Interest in assessing research impacts is increasing due to its importance for informing actions and funding allocation decisions. The level of innovation (also called “innovation degree” in the following article), one of the most essential factors that affect scientific literature’s impact, has also received increasing attention. However, current studies mainly focus on the overall innovation degree of scientific literature at the macro level, while ignoring the innovation degree of a specific knowledge element (KE), such as the method knowledge element (MKE). A macro level view causes difficulties in identifying which part of the scientific literature contains the innovations. To bridge this gap, a more fine-grained evaluation of academic papers is urgent. The fine-grained evaluation method can ensure the quality of a paper before being published and identify useful knowledge content in a paper for academic users. Different KEs can be used to perform the fine-grained evaluation. However, MKEs are usually considered as one of the most essential knowledge elements among all KEs. Therefore, this study proposes a framework to measure the innovation degree of method knowledge elements (MIDMKE) in scientific literature. In this framework, we first extract the MKEs using a rule-based approach and generate a cloud drop for each MKE using the biterm topic model (BTM). The generated cloud drop is then used to create a method knowledge cloud (MKC) for each MKE. Finally, we calculate the innovation score of a MKE based on the similarity between it and other MKEs of its type. Our empirical study on a China National Knowledge Infrastructure (CNKI) academic literature dataset shows the proposed approach can measure the innovation of MKEs in scientific literature effectively. Our proposed method is useful for both reviewers and funding agencies to assess the quality of academic papers. The dataset, the code for implementation the algorithms, and the complete experiment results will be released at: https://github.com/haihua0913/midmke.
... To decide on the number of topics in news articles, this study set the number of topics to 9 after testing with any number between 5 and 15 and conducted topic modeling with iteration=1000, α=0.1, β=0.01 based on the studies by Zhao et al. (2015) and Lu et al. (2011). Topics of news articles and café posts are as follows table2 & 3. Step 5: Expected support for implementation of measures by public institutions ...
Article
Full-text available
Purpose: This study determines the possibility of public service innovation to meet the rapid changes in information technology (IT) and the need for new governance by analyzing three cases in South Korea. Methodology: The Smart Governance-Decision Support Systems (SG-DSS) in this study is a new form that guarantees the voluntary participation of citizens by applying IT to governance. SG-DSS supports the demand response that fulfills universal values and decisions about priorities by collecting citizens’ needs. It also encourages citizens or stakeholders to participate in establishing implementation plans that are more specific and fit for reality, giving legitimacy to public service policies and developing them into a driving force. Findings: The three case studies on Korean public policies show how public opinions reflect public service policies. Therefore, the findings of this study could lay the foundation for customized public services based on intelligent citizen participation by overcoming the current limitations. Unique contribution to theory, practice and policy: SG-DSS supports the demand response that fulfills universal values and decisions about priorities by collecting citizens’ needs. It also encourages citizens or stakeholders to participate in establishing implementation plans that are more specific and fit for reality, giving legitimacy to public service policies and developing them into a driving force. The core value of smart governance is to apply IT innovations such as big data and AI to public services. Furthermore, advanced technology enables the collection and application of actual public opinions, thereby improving public to be more objective and efficient.
... In LDA, each document is represented by a bag of words, where the order of words is ignored. We used the implementation of LDA from Gensim for python programming language with a standard parameter setting [23,24], and tested with 50, 100, 150, and 200 topics. Third, we used Paragraph Vector, or Doc2Vec, to represent a trial registration as a vector representation. ...
Article
Full-text available
Background Clinical trial registries can be used as sources of clinical evidence for systematic review synthesis and updating. Our aim was to evaluate methods for identifying clinical trial registrations that should be screened for inclusion in updates of published systematic reviews. Methods A set of 4644 clinical trial registrations (ClinicalTrials.gov) included in 1089 systematic reviews (PubMed) were used to evaluate two methods (document similarity and hierarchical clustering) and representations (L2-normalised TF-IDF, Latent Dirichlet Allocation, and Doc2Vec) for ranking 163,501 completed clinical trials by relevance. Clinical trial registrations were ranked for each systematic review using seeding clinical trials, simulating how new relevant clinical trials could be automatically identified for an update. Performance was measured by the number of clinical trials that need to be screened to identify all relevant clinical trials. Results Using the document similarity method with TF-IDF feature representation and Euclidean distance metric, all relevant clinical trials for half of the systematic reviews were identified after screening 99 trials (IQR 19 to 491). The best-performing hierarchical clustering was using Ward agglomerative clustering (with TF-IDF representation and Euclidean distance) and needed to screen 501 clinical trials (IQR 43 to 4363) to achieve the same result. Conclusion An evaluation using a large set of mined links between published systematic reviews and clinical trial registrations showed that document similarity outperformed hierarchical clustering for identifying relevant clinical trials to include in systematic review updates.
... We used Latent Dirichlet Allocation (LDA), which is an unsupervised topic modeling technique for identifying latent topics in a collection of documents (corpus). LDA is regarded as among the most effective topic modeling techniques and has been widely used for both confirmatory and exploratory purposes (Lu et al. 2011;Li et al. 2016). We used bag-of-words (BOW) approach to provide input to LDA for generating latent topics in the document corpus such that every topic consists of word combinations to capture the context of their usage and each document can then be mapped to multiple topics that we derived (Blei et al. 2003). ...
Article
Amid the flood of fake news on Coronavirus disease of 2019 (COVID-19), now referred to as COVID-19 infodemic, it is critical to understand the nature and characteristics of COVID-19 infodemic since it not only results in altered individual perception and behavior shift such as irrational preventative actions but also presents imminent threat to the public safety and health. In this study, we build on First Amendment theory, integrate text and network analytics and deploy a three-pronged approach to develop a deeper understanding of COVID-19 infodemic. The first prong uses Latent Direchlet Allocation (LDA) to identify topics and key themes that emerge in COVID-19 fake and real news. The second prong compares and contrasts different emotions in fake and real news. The third prong uses network analytics to understand various network-oriented characteristics embedded in the COVID-19 real and fake news such as page rank algorithms, betweenness centrality, eccentricity and closeness centrality. This study carries important implications for building next generation trustworthy technology by providing strong guidance for the design and development of fake news detection and recommendation systems for coping with COVID-19 infodemic. Additionally, based on our findings, we provide actionable system focused guidelines for dealing with immediate and long-term threats from COVID-19 infodemic.
... 이는 토픽 내부의 (12) . (13,14) . Figure 2는 ...
Article
Global climate change has caused various natural disasters, which has resulted in serious damage to the society. Therefore, there has been a growing interest in utilizing eco-friendly energy sources such as hydrogen fuel. Hydrogen vehicles and infrastructures have been studied extensively. However, the research trends of hydrogen refueling stations have not been systematically analyzed using text mining with domestic research articles. The keyword network and research topics were analyzed based on the Korea Citation Index (KCI) data of the past 10 years. The analysis revealed that “hydrogen refueling station,” “fuel cell,” and “charging station” are new research keywords. Furthermore, topics such as “hydrogen storage,” “hydrogen and electric vehicle,” and “safety in hydrogen refueling station” are becoming increasingly popular. These quantitative analysis results provide an insight on the development of hydrogen infrastructure and research policy.
... Probabilistic latent semantic indexing-based and LDA-based topic models (e.g. Lu et al., 2011;Luo et al., 2020) are typical examples. Neural network-based pre-trained language models including BERT and ELMo are often used to learn universal language representations based on syntactic features of the language (Devlin et al., 2018;Li et al., 2020). ...
Article
We proposed a new rule-based text analysis method to effectively summarize and transform unstructured user-generated content (online customer reviews) into an analysable form for tourism and hospitality research. To differentiate this method, we developed the Disintegrating, Summarizing, Straining, Bagging, Upcycling, and Scoring – DiSSBUS – algorithm which can address the following problems in previous approaches: (1) false identification of irrelevant aspect terms, (2) improper handling of multiple aspects and sentiments within a text unit, and (3) data sparsity. The algorithm’s distinctive advantage is to decompose a single review into a set of bi-terms related to the aspects that are pre-specified based on domain knowledge. Therefore, this algorithm can identify customer opinions on specific aspects, which allows to extract variables of interest from online reviews. To evaluate the performance of our confirmatory aspect-level opinion-mining algorithm, we applied it to customer reviews on restaurants in Hawaii. The findings from the empirical test validated its effectiveness.
Article
Full-text available
Probabilistic Latent Semantic Analysis (PLSA) is a fundamental text analysis technique that models each word in a document as a sample from a mixture of topics. PLSA is the precursor of probabilistic topic models including Latent Dirichlet Allocation (LDA). PLSA, LDA and their numerous extensions have been successfully applied to many text mining and retrieval tasks. One important extension of LDA is supervised LDA (sLDA), which distinguishes itself from most topic models in that it is supervised. However, to the best of our knowledge, no prior work extends PLSA in a similar manner sLDA extends LDA by jointly modeling the contents and the responses of documents. In this paper, we propose supervised PLSA (sPLSA) which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. The major challenge lies in estimating a document’s topic distribution which is a constrained probability that is dictated by both the content and the response of the document. To tackle this challenge, we introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient Expectation and Maximization (EM) algorithm for parameter estimation. Compared to sLDA, sPLSA converges much faster and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA an appealing choice for latent response analysis such as ranking latent topics by their factorized response values. We apply the proposed sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues.
Article
Various studies have been conducted to minimize property damage and casualties caused by fire accidents, which occur over 40,000 times annually on average. The research papers published in the Fire Science and Engineering journal indexed in the Korea Citation Index and the Fire Technology journal indexed in the Science Citation Index Expanded databases over the last 10 years were analyzed utilizing text-mining techniques. Similar research papers published these two journals were explored and significant differences in perceptions were identified. Recently, the proportion of studies on “Wild-Urban-Interface (WUI) fire” and “Battery fire” published in the Fire Technology journal has increased; however, “Firefighter organization” related research papers are being actively published in the Fire Science and Engineering journal. A quantitative analysis of the studies on fire incidents can provide significant information on developing new policies essential to reducing fire damage.
Chapter
The success of topic modeling algorithms depends on their ability to analyze, index and classify large text corpora. These algorithms could be classified into two groups where the first one is oriented to classify textual corpus according to their dominant topics such as LDA, LSA and PLSA which are the most known techniques. The second group is dedicated to extract the relationships among the generated topics like HLDA, PAM and CTM. However, each algorithm among these groups is dedicated to a single task and there is no technique that makes it possible to carry out several analyses on textual corpus at the same time. In order to cope with this problem, we propose here a new technique based on LDA topic modeling to automatically classify a large text corpora according to their relevant topics, discover new topics (sub-topics) based on the extracted ones and hierarchy the generated topics in order to analyse data more deeply. Experiments have been conducted to measure the performance of our solution compared to the existing techniques. The results obtained are more than satisfactory.
Article
We used Natural Language Processing (NLP) to assess topic diversity in all research articles (∼75,000) from eighteen water science and hydrology journals published between 1991 and 2019. We found that individual water science and hydrology research articles are becoming increasingly diverse in the sense that, on average, the number of topics represented in individual articles is increasing, which may be a sign of increasing interdisciplinarity. This is true even though the body of water science and hydrology literature as a whole is not becoming more topically diverse. Topics with the largest increases in popularity were Climate Change Impacts, Water Policy & Planning, and Pollutant Removal. Topics with the largest decreases in popularity were Stochastic Models and Numerical Models. At a journal level, Water Resources Research, Journal of Hydrology, and Hydrological Processes are the three most topically diverse journals among the corpus that we studied.
Chapter
It is a known fact that all of the events that people in the society are exposed to while continuing their lives have important effects on their quality of life. Events that have significant effects on a large part of the society are shared with the public through news texts. With a perspective that keeps up with the digital age, the problem of automatic detection and tracking of events in the news with natural language processing methods is discussed. An event-based news clustering approach is presented for data regimentation, which is necessary to extract meaningful information from news in the form of heaps in online environments. In this approach, it is aimed to increase clustering performance and speed by making use of named entities. Additionally, an event-based text clustering dataset was created by the researchers and brought to the literature. By using the B-cubed evaluation metric on this test dataset, which consists of 930 different event groups and has a total of 19,848 news, a solution to the event-based text clustering problem was provided with an F-score of over 85%.
Article
Topic models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection has documents nested inside a higher order structure such as articles nested in journals, podcasts within authors, or web pages nested in web sites. In a single collection of documents, topics are global or shared across all documents. For web pages nested in web sites, topic frequencies likely vary across web sites and within a web site, topic frequencies almost certainly vary from web page to web page. A hierarchical prior for topic frequencies models this hierarchical structure with a global topic distribution, web site topic distributions varying around the global topic distribution, and web page topic distributions varying around the web site topic distribution. Web pages in one United States local health department web site often contain local geographic and news topics not found on web pages of other local health department web sites. For web pages nested in web sites, some topics are likely local topics and unique to an individual web site. Regular topic models ignore the nesting structure and may identify local topics but cannot label those topics as local nor identify the corresponding web site owner. Explicitly modeling local topics identifies the owning web site and identifies the topic as local. In US health web site data, topic coverage is defined at the web site level after removing local topic words from pages. Hierarchical local topic models can be used to study how well health topics are covered.
Article
Full-text available
The aggregation of the same type of socio-economic activities in urban space generates urban functional zones, each of which has one function as the main (e.g., residential, educational or commercial), and is an important part of the city. With the development of deep learning technology in the field of remote sensing, the accuracy of land use decoding has been greatly improved. However, no finer remote sensing image could directly obtain economic and social information and it has a high revisit cycle (low temporal resolution), while urban flooding often lasts only a few hours. Cities contain a large amount of ”social sensing” data that records human socio-economic activities, and GIS is a natural discipline with strong socio-economic ties. We propose a new GeoSemantic2vec algorithm for urban function recognition based on the latest advances in natural language processing technology (BERT model), which utilizes the rich semantic information in urban POI data to portray urban functions. Taking the Wuhan flooding event in summer 2020 as an example, we identified 84.55% of the flooding locations in social media. We also use the new algorithm proposed in this paper to divide the main urban area of Wuhan into 8 types of urban functional zones (kappa coefficient is 0.615) and construct a ”City Portrait” of flooding locations. This paper summarizes the progress of existing research on urban function identification using natural language processing techniques and proposes a better algorithm, which is of great value for urban flood location detection and risk assessment.
Article
Topic modeling can be synergically interrelated with document clustering. We present an innovative unsupervised approach to the interrelationship of topic modeling with document clustering. The devised approach exploits Bayesian generative modeling and posterior inference, to seamlessly unify and jointly carry out the two tasks, respectively. Specifically, a Bayesian nonparametric model of text collections, formulates an unprecedented interrelationship of word-embedding topics with a Dirichlet process mixture of cluster components. The latter enables countably infinite clusters and permits the automatic inference of their actual number in a statistically principled manner. All latent clusters and topics under the foresaid model are inferred through collapsed Gibbs sampling and parameter estimation. An extensive empirical study of the presented approach is effected on benchmark real-world corpora of text documents. The experimental results demonstrate its higher effectiveness in partitioning text collections and coherently discovering their semantics, compared to state-of-the-art competitors and tailored baselines. Computational efficiency is also looked into under different conditions, in order to provide an insightful analysis of scalability.
Conference Paper
Full-text available
Contextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. Since the topics covered in a document are usually related to the context of the doc- ument, analyzing topical themes within context can poten- tially reveal many interesting theme patterns. In this paper, we propose a new general probabilistic model for contextual text mining that can cover several existing models as special cases. Specifically, we extend the probabilistic latent seman- tic analysis (PLSA) model by introducing context variables to model the context of a document. The proposed mixture model, called contextual probabilistic latent semantic anal- ysis (CPLSA) model, can be applied to many interesting mining tasks, such as temporal text mining, spatiotempo- ral text mining, author-topic analysis, and cross-collection comparative analysis. Empirical experiments show that the proposed mixture model can discover themes and their con- textual variations effectively.
Conference Paper
Full-text available
In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a re duced dimensional- ity representation. Specifically, we present DiscLDA, a dis criminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transforma- tion is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mix- ture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a docu- ment collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsu- pervised LDA on the 20 Newsgroups document classification ta sk and show how our model can identify shared topics across classes as well as class-dependent topics.
Conference Paper
Full-text available
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide explo- ration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.
Conference Paper
Full-text available
Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each docu- ment on the hidden topics independently and the number of param- eters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allo- cation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evi- dence that the document space is Euclidean, or flat . Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specif- ically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the am- bient space and directly performs the topic modeling on this doc- ument manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimen- tal results show that LapPLSI provides better representation in the sense of semantic structure.
Conference Paper
Full-text available
Non-negative Matrix Factorization (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been successfully applied to a number of text analysis tasks such as document clustering. Despite their different inspirations, both methods are instances of multinomial PCA [1]. We further explore this relationship and first show that PLSA solves the problem of NMF with KL divergence, and then explore the implications of this relationship.
Conference Paper
Full-text available
Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework.
Conference Paper
Full-text available
In this paper, we formally define the problem of topic mod- eling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph struc- ture in the data. The proposed method combines topic mod- eling and social network analysis, and leverages the power of both statistical topic models and discrete regularization. The output of this model can summarize well topics in text, map a topic onto the network, and discover topical commu- nities. With appropriate instantiations of the topic model and the graph-based regularizer, our model can be applied to a wide range of text mining problems such as author- topic analysis, community discovery, and spatial text min- ing. Empirical experiments on two data sets with different genres show that our approach is effective and outperforms both text-oriented methods and network-oriented methods alone. The proposed model is general; it can be applied to any text collections with a mixture of topics and an associ- ated network structure.
Article
Full-text available
A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.
Article
Full-text available
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain#speci#c synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing #LSI# by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and de#nes a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with di#erent dimensionalities has proven to be advantageous. 1
Article
Full-text available
We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.
Article
Full-text available
We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting---which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts.
Conference Paper
This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.
Article
The generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents. Previous results with aspect models have been promising, but hindered by the computational difficulty of carrying out inference and learning. This paper demonstrates that the simple variational methods of Blei et al (2001) can lead to inaccurate inferences and biased learning for the generative aspect model. We develop an alternative approach that leads to higher accuracy at comparable cost. An extension of Expectation-Propagation is used for inference and then embedded in an EM algorithm for learning. Experimental results are presented for both synthetic and real data sets.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Conference Paper
This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.
Conference Paper
In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models. The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model. Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models.
Conference Paper
The Indian buffet process (IBP) is an exchangeable distribution over binary ma- trices used in Bayesian nonparametric featural models. In this paper we propose a three-parameter generalization of the IBP exhibiting power-law behavior. We achieve this by generalizing the beta process (the de Finetti measure of the IBP) to the stable-beta process and deriving the IBP corresponding to it. We find interest- ing relationships between the stable-beta process and the Pitman-Yor process (an- other stochastic process used in Bayesian nonparametric models with interesting power-law properties). We derive a stick-breaking construction for the stable-beta process, and find that our power-law IBP is a good model for word occurrences in document corpora.
Conference Paper
Topic models, such as latent Dirichlet allocation (LDA), have been an ef- fective tool for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vo- cabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about sports is more likely to also be about health than international finance. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution (1). We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.
Conference Paper
We explore the utility of different types of topic models for retrieval purposes. Based on prior work, we describe several ways that topic models can be integrated into the retrieval process. We evaluate the effectiveness of different types of topic models within those retrieval approaches. We show that: (1) topic models are effective for document smoothing; (2) more rigorous topic models such as Latent Dirichlet Allo- cation provide gains over cluster-based models; (3) more elaborate topic models that capture topic dependencies provide no additional gains; (4) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (5) doing query expansion should utilize topics discovered in the top feedback documents instead of coarse-grained topics from the whole corpus; (6) generally, incorporating topics in the feedback documents for building relevance models can ben- efit the performance more for queries that have more relevant documents.
Conference Paper
Latent Dirichlet allocation (LDA) and other related topic models are increasingly popu- lar tools for summarization and manifold dis- covery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko alloca- tion model (PAM), which captures arbitrary, nested, and possibly sparse correlations be- tween topics using a directed acyclic graph (DAG). The leaves of the DAG represent in- dividual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other in- terior nodes (topics). PAM provides a ex-
Conference Paper
A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is in- tractable, several estimators for this prob- ability have been used in the topic model- ing literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonly-used methods are unlikely to accurately estimate the probability of held- out documents, and propose two alternative methods that are both accurate and ecient. In this paper we consider only the simplest topic model, latent Dirichlet allocation (LDA), and compare a number of methods for estimating the probability of held-out documents given a trained model. Most of the methods presented, however, are applicable to more complicated topic models. In addition to com- paring evaluation methods that are currently used in the topic modeling literature, we propose several al- ternative methods. We present empirical results on synthetic and real-world data sets showing that the currently-used estimators are less accurate and have higher variance than the proposed new estimators.
Conference Paper
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has re cently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is he avily cited in the machine learning literature, but its feasibilit y and effectiveness in information retrieval is mostly un known. In this paper, we study how to efficiently use LDA to impro ve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
Conference Paper
In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.
Conference Paper
In this paper, we propose a novel non-negative matrix factorization (NMF) to the affinity matrix for document clustering, which enforces non-negativity and orthogonality constraints simultaneously. With the help of orthogonality constraints, this NMF provides a solution to spectral clustering, which inherits the advantages of spectral clustering and presents a much more reasonable clustering interpretation than the previous NMF-based clustering methods. Furthermore, with the help of non-negativity constraints, the proposed method is also superior to traditional eigenvector-based spectral clustering, as it can inherit the benefits of NMF-based methods that the non-negative solution is institutive, from which the final clusters could be directly derived. As a result, the proposed method combines the advantages of spectral clustering and the NMF-based methods together, and hence outperforms both of them, which is demonstrated by experimental results on TDT2 and Reuters-21578 corpus.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.
Conference Paper
In this paper, we define the problem of topic-sentiment anal- ysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their asso- ciated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment mod- els and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Em- pirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog col- lections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and senti- ments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction.
Article
The generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents.
Article
The language modeling approach to retrieval has been shown to perform well empirically. One advantage of this new approach is its statistical foundations. However, feedback, as one important component in a retrieval system, has only been dealt with heuristically in this new retrieval approach: the original query is usually literally expanded by adding ditional terms to it. Such expansion-based feedback creates an inconsistent interpretation of the original and the expanded query. In this paper, we present a more principled approach to feedback in the language modeling approach. Specifically, we treat feedback as updating the query language model based on the extra evidence carried by the feedback documents. Such a model-based feedback strategy easily fits into an extension of the language modeling approach. We propose and evaluate two different approaches to updating a query language model based on feedback documents, one based on a generarive probabilistic model of feedback documents and one based on minimization of the KL-divergence over feedback documents. Experiment resuits show that both approaches are effective and outperform the Rocchio feedback approach.
Article
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing , which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections.
Pachinko allocation: Dag-structured mixture models of topic correlations Topic modeling with network regularization
  • W Li
  • A Mccallum
  • Acm
  • Q Mei
  • D Cai
  • D Zhang
  • C Zhai
Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06 (pp. 577–584). ACM. Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. In WWW '08: Proceeding of the 17th international conference on World Wide Web (pp. 101–110). New York, NY, USA: ACM.
Modeling hidden topics on document manifold Reading tea leaves: How humans interpret topic models
  • D Cai
  • Q Mei
  • J Han
  • C Zhai
Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), CIKM (pp. 911–920). ACM. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems. MIT Press.
Hierarchical topic models and the nested chinese restaurant process In Advances in neural information processing systems Correlated topic models Latent dirichlet allocation
  • D M Blei
  • T L Griffiths
  • M I Jordan
  • J B Tenenbaum
  • D M Blei
  • J D Lafferty
Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (p. 2003). MIT Press. Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In NIPS. MIT Press Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.