Article

Investigating task performance of probabilistic topic models: An empirical study of PLSA and LDA

April 2011
Information Retrieval Journal 14(2):178-203

April 2011
14(2):178-203

DOI:10.1007/s10791-010-9141-9

Source
DBLP

Authors:

Qiaozhu Mei

University of Michigan

Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.

Document Clustering vs Topic Models: A Case Study

Conference Paper

Full-text available

Dec 2021

Drawing the big picture of games in education: A topic modeling-based review of past 55 years

Article

Dec 2022
COMPUT EDUC

The literature of games in education has a rich and multidisciplinary content. Due to the large number of studies in the field, it is not easy to analyze all relevant studies. There are few studies exploring the big picture of research trends in the field. For this reason, the purpose of this study is to examine longitudinal trends of game-based research in education using text mining techniques. 4980 publications were retrieved as an experimental dataset indexed by the SCOPUS database in the period 1967 to mid-2021. The results include descriptive statistics of game-based research, trends of the research topics, and trends in the frequency of each topic over time. They show that the number of studies focusing on the use of games in education has increased, particularly since the 2000s when internet use accelerated and became widespread. Approximately 70% of all the studies were conducted in the last 10 years. One third of the studies is related to the main topic of game-based learning. It is significant that in the last three decades the topic of serious games has been among the top three trends. Considering usage acceleration of the topics, the highest values belong to game-based learning, serious games and student science games, in that order. The findings of this study are expected to guide the field by providing a better understanding of the trends of games in education and offer a direction for future research.

Hierarchical Bayesian Text Modeling for the Unsupervised Joint Analysis of Latent Topics and Semantic Clusters

Article

May 2022
INT J APPROX REASON

Topic modeling can be unified synergically with document clustering. In this manuscript, we propose two innovative unsupervised approaches for the combined modeling and interrelated accomplishment of the two tasks. Both approaches rely on respective Bayesian generative models of topics, contents and clusters in textual corpora. Such models treat topics and clusters as linked latent factors in document wording. In particular, under the generative model of the second approach, textual documents are characterized by topic distributions, that are allowed to vary around the topic distributions of their membership clusters. Within the devised models, algorithms are designed to implement Rao-Blackwellized Gibbs sampling together with parameter estimation. These are derived mathematically for carrying out topic modeling with document clustering in a simultaneous and interrelated manner. A comparative empirical evaluation demonstrates the effectiveness of the presented approaches, over different families of state-of-the-art competitors, in clustering real-world benchmark text collections and, also, uncovering their underlying semantics. Besides, a case study is developed as an insightful qualitative analysis of results on real-world text corpora.

Exploratory Analysis of Topic Interests and Their Evolution in Bioinformatics Research Using Semantic Text Mining and Probabilistic Topic Modeling

Article

Full-text available

Jan 2022

Bioinformatics, which has developed rapidly in recent years with the collaborative contributions of the fields of biology and informatics, provides a deeper perspective on the analysis and understanding of complex biological data. In this regard, bioinformatics has an interdisciplinary background and a rich literature in terms of domain-specific studies. Providing a holistic picture of bioinformatics research by analyzing the major topics and their trends and developmental stages is critical for an understanding of the field. From this perspective, this study aimed to analyze the last 50 years of bioinformatics studies (a total of 71,490 articles) by using an automated text-mining methodology based on probabilistic topic modeling to reveal the main topics, trends, and the evolution of the field. As a result, 24 major topics that reflect the focuses and trends of the field were identified. Based on the discovered topics and their temporal tendencies from 1970 until 2020, the developmental periods of the field were divided into seven phases, from the “newborn” to the “wisdom” stages. Moreover, the findings indicated a recent increase in the popularity of the topics “Statistical Estimation”, “Data Analysis Tools”, “Genomic Data”, “Gene Expression”, and “Prediction”. The results of the study revealed that, in bioinformatics studies, interest in innovative computing and data analysis methods based on artificial intelligence and machine learning has gradually increased, thereby marking a significant improvement in contemporary analysis tools and techniques based on prediction.

Contrastive Learning for Neural Topic Model

Preprint

Full-text available

Oct 2021

Recent empirical studies show that adversarial topic models (ATM) can successfully capture semantic patterns of the document by differentiating a document with another dissimilar sample. However, utilizing that discriminative-generative architecture has two important drawbacks: (1) the architecture does not relate similar documents, which has the same document-word distribution of salient words; (2) it restricts the ability to integrate external information, such as sentiments of the document, which has been shown to benefit the training of neural topic model. To address those issues, we revisit the adversarial topic architecture in the viewpoint of mathematical analysis, propose a novel approach to re-formulate discriminative goal as an optimization problem, and design a novel sampling method which facilitates the integration of external variables. The reformulation encourages the model to incorporate the relations among similar samples and enforces the constraint on the similarity among dissimilar ones; while the sampling method, which is based on the internal input and reconstructed output, helps inform the model of salient words contributing to the main topic. Experimental results show that our framework outperforms other state-of-the-art neural topic models in three common benchmark datasets that belong to various domains, vocabulary sizes, and document lengths in terms of topic coherence.

Application of Topic Modelling for Literature Review in Management Research

Chapter

Sep 2021

Barsha Saha

Games and Beyond: Analyzing the Bullet Chats of Esports Livestreaming

Article

May 2024

Esports, short for electronic sports, is a form of competition using video games and has attracted more than 530 million audiences worldwide. To watch esports, people utilize online livestreaming platforms. Recently, a novel interaction method, namely "bullet chats," has been introduced on these platforms. Different from conventional comments, bullet chats are scrolling comments posted by audiences that are synchronized to the livestreaming timeline, enabling audiences to share and communicate their immediate perspectives. The real-time nature of bullet chats, therefore, brings a new perspective to esports analysis. In this paper, we conduct the first empirical study on the bullet chats for esports, focusing on one of the most popular video games, i.e., League of Legends (LoL). Specifically, we collect 21 million bullet chats of LoL from Jan. 2023 to Mar. 2023 across two mainstream platforms (Bilibili and Huya). By performing quantitative analysis, we reveal how the quantity and toxicity of bullet chats distribute (and change) w.r.t. three aspects, i.e., the season, the team, and the match. Our findings show that teams with higher rankings tend to attract a greater quantity of bullet chats, and these chats are often characterized by a higher degree of toxicity. We then utilize topic modeling to identify topics among bullet chats. Interestingly, we find that a considerable portion of topics (14.14% on Bilibili and 22.94% on Huya) discuss themes beyond the game, including genders, entertainment stars, non-esports athletes, and so on. Besides, by further modeling topics on toxic bullet chats, we find hateful speech targeting different social groups, ranging from professions, regions, etc. To the best of our knowledge, this work is the first measurement of bullet chats on esports livestreaming. We believe our study can shed light on esports research from the perspective of bullet chats.

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering

Article

Full-text available

May 2024

Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.

Does ChatGPT change artificial intelligence-enabled marketing capability? Social media investigation of public sentiment and usage

Article

Jan 2024

Vu Minh Ngo

The recent advent of ChatGPT has stirred substantial attention and debates, potentially altering the dynamics across various industries, notably marketing. This pioneering study delves into the public reactions and applications of ChatGPT within marketing realms. Leveraging a text-mining methodology, a corpus of over 600,000 tweets harvested before and after ChatGPT’s launch from January 2021 to April 2023 was scrutinized to gauge public sentiment towards AI-incorporated tools in marketing, and to unearth the predominant themes within public discourse. Initial findings unveiled a buoyant public sentiment towards AI-facilitated tools, which however, ebbed in January 2023, driven by apprehensions regarding AI technology’s limitations and potential perils. Subsequent months witnessed a rebound in sentiment, stabilizing above the positive threshold, as Twitter users increasingly acknowledged ChatGPT’s prospective merits on employment and daily lives. However, compared to the period before the introduction of ChatGPT, there has been a decline in the general public’s sentiment towards AI in marketing. Furthermore, the analysis discerned a convergence in the core topics broached by the public concerning AI and ChatGPT’s ramifications on marketing. While the automation of mundane tasks and heightened customer experience were lauded, trepidations surrounding job displacement and the ethical quandaries of supplanting human labor with machines surfaced. This exposition recommends that enterprises meticulously assess the prospective impact of AI on their personnel, advocating for the judicious and ethical deployment of such emergent technologies.

Using Machine Learning With Technological Innovation Factors to Predict the Transferability of University Patents

Article

Full-text available

Jan 2023

We quantify the impact of technological innovation factors on university patent transferability, accurately identify transferable patents, and address the lack of interpretability in existing patent transferability models by applying the latent Dirichlet allocation (LDA) model to conduct text mining and feature extraction on abstracts of university patents in the field of artificial intelligence. We then construct a patent transferability fusion index system that includes technological innovation features and quality features. Four typical machine learning algorithms, namely support vector machine (SVM), random forest (RF), artificial neural network (ANN), and extreme gradient boosting (XGBoost) were used to predict university patent transferability. We use SHapley Additive exPlanations (SHAP) to explore feature importance and interactions based on the model with the strongest performance. Our results show that (1) XGBoost outperforms the other algorithms in predicting university patent transferability; (2) fusion indicators can effectively improve prediction performance with respect to university patent transferability; (3) the importance of technological innovation features generated with XGBoost is generally high; and (4) the impact of both technology innovation and patent quality features on university patent transferability is nonlinear and there are significant positive interaction effects between them.

The nexus between quality of customer relationship management systems and customers' satisfaction: Evidence from online customers’ reviews

Article

Full-text available

Nov 2023

Customer Relationship Management (CRM) is a method of management that aims to establish, develop, and improve relationships with targeted customers in order to maximize corporate profitability and customer value. There have been many CRM systems in the market. These systems are developed based on the combination of business requirements, customer needs, and industry best practices. The impact of CRM systems on the customers' satisfaction and competitive advantages as well as tangible and intangible benefits are widely investigated in the previous studies. However, there is a lack of studies to assess the quality dimensions of these systems to meet an organization's CRM strategy. This study aims to investigate customers' satisfaction with CRM systems through online reviews. We collected 5172 online customers' reviews from 8 CRM systems in the Google play store platform. The satisfaction factors were extracted using Latent Dirichlet Allocation (LDA) and grouped into three dimensions; information quality, system quality, and service quality. Data segmentation is performed using Learning Vector Quantization (LVQ). In addition, feature selection is performed by the entropy-weight approach. We then used the Adaptive Neuro Fuzzy Inference System (ANFIS), the hybrid of fuzzy logic and neural networks, to assess the relationship between these dimensions and customer satisfaction. The results are discussed and research implications are provided.

Exploring Gamification Research Trends Using Topic Modeling

Article

Full-text available

Oct 2023

Gamification holds significant importance as an efficacious means to motivate individuals, stimulate their engagement, and foster desired behaviors. There is an increasing interest among researchers in exploring the domain of gamification. Consequently, it becomes crucial to identify specific research trends within this field. This study employs a comprehensive analysis of 4743 articles sourced from the Scopus database, utilizing the topic modeling approach, with the objective of discerning research patterns and trends within the gamification domain. The findings revealed the existence of thirteen distinct topics within the field. Notably, "Health training," "Enhancing learning with technology," and "Game design framework" emerged as the most prominent topics, based on their frequency of research publications and popularity. This study serves as a valuable resource for researchers and practitioners seeking to stay abreast of the latest advancements in gamification. The identified issues through topic modeling can be employed to identify gaps in current research and potential directions for future research endeavors.

SAE-NTM: Sentence-Aware Encoder for Neural Topic Modeling

Conference Paper

Jan 2023

Information Retrieval: Recent Advances and Beyond

Article

Full-text available

Jan 2023

This paper provides an extensive and thorough overview of the models and techniques utilized in the first and second stages of the typical information retrieval processing chain. Our discussion encompasses the current state-of-the-art models, covering a wide range of methods and approaches in the field of information retrieval. We delve into the historical development of these models, analyze the key advancements and breakthroughs, and address the challenges and limitations faced by researchers and practitioners in the domain. By offering a comprehensive understanding of the field, this survey is a valuable resource for researchers, practitioners, and newcomers to the information retrieval domain, fostering knowledge growth, innovation, and the development of novel ideas and techniques.

Automating the Information Extraction from Qualitative Data: a Study of Approaches to Analyze Semi-Structured Interview Transcripts

Thesis

Full-text available

Dec 2022

Angelina Parfenova

Assessment of the Quality of Topic Models for Information Retrieval Applications

Conference Paper

Full-text available

Jul 2023

Topic modelling is an approach to generation of descriptions of document collections as a set of topics where each has a distinct theme and documents are a blend of topics. It has been applied to retrieval in a range of ways, but there has been little prior work on measurement of whether the topics are descriptive in this context. Moreover, existing methods for assessment of topic quality do not consider how well individual documents are described. To address this issue we propose a new measure of topic quality, which we call specificity; the basis of this measure is the extent to which individual documents are described by a limited number of topics. We also propose a new experimental protocol for validating topic-quality measures, a 'noise dial' that quantifies the extent to which the measure's scores are altered as the topics are degraded by addition of noise. The principle of the mechanism is that a meaningful measure should produce low scores if the 'topics' are essentially random. We show that specificity is at least as effective as existing measures of topic quality and does not require external resources. While other measures relate only to topics, not to documents, we further show that specificity correlates to the extent to which topic models are informative in the retrieval process. CCS CONCEPTS • Information systems → Document topic models.

A Review of Research in First-Stage Retrieval

Conference Paper

Apr 2023

In this paper, the first-stage retrieval technology is studied from four aspects: the development background, the frontier technology, the current challenges, and the future directions. Our contributionconsists of two main parts. On the one hand, this paper reviewed some retrieval techniques proposed by researchers and drew targeted conclusions through comparative analysis. On the other hand, dif erent research directions are discussed, and the impact of the combination of dif erent techniques on first-stage retrieval is studied and compared. In this way, this survey provides a comprehensive overview of the fieldand will hopefully be used by researchers and practitioners in the first-stage retrieval domain, inspiringnew ideas and further developments.

A two-staged NLP-based framework for assessing the sentiments on Indian supreme court judgments

Article

Full-text available

Apr 2023

Topic modeling is a powerful technique for uncovering hidden patterns in large documents. It can identify themes that are highly connected and lead to a certain region while accounting for temporal and spatial complexity. In addition, sentiment analysis can determine the sentiments of media articles on various issues. This study proposes a two-stage natural language processing-based model that utilizes Latent Dirichlet Allocation to identify critical topics related to each type of legal case or judgment and the Valence Aware Dictionary Sentiment Reasoner algorithm to assess people's sentiments on those topics. By applying these strategies, this research aims to influence public perception of controversial legal issues. This study is the first of its kind to use topic modeling and sentiment analysis on Indian legal documents and paves the way for a better understanding of legal documents.

Information Retrieval: Recent Advances and Beyond

Preprint

Jan 2023

In this paper, we provide a detailed overview of the models used for information retrieval in the first and second stages of the typical processing chain. We discuss the current state-of-the-art models, including methods based on terms, semantic retrieval, and neural. Additionally, we delve into the key topics related to the learning process of these models. This way, this survey offers a comprehensive understanding of the field and is of interest for for researchers and practitioners entering/working in the information retrieval domain.

Passenger intelligence as a competitive opportunity: unsupervised text analytics for discovering airline-specific insights from online reviews

Article

Full-text available

Jan 2023
ANN OPER RES

Driven by the fierce competition in the airline industry, carriers strive to increase their customer satisfaction by understanding their expectations and tailoring their service offerings. Due to the explosive growth of social media usage, airlines have the opportunity to capitalize on the abundantly available online customer reviews (OCR) to extract key insights about their services and competitors. However, the analysis of such unstructured textual data is complex and time-consuming. This research aims to automatically and efficiently extract airline-specific intelligence (i.e., passenger-perceived strengths and weaknesses) from OCR. Topic modeling algorithms are employed to discover the prominent service quality aspects discussed in the OCR. Likewise, sentiment analysis methods and collocation analysis are used to classify review sentence sentiment and ascertain the major reasons for passenger satisfaction/dissatisfaction, respectively. Subsequently, an ensemble-assisted topic model (EA-TM) and sentiment analyzer (E-SA) is proposed to classify each review sentence to the most representative aspect and sentiment. A case study involving 398,571 airline review sentences of a US-based target carrier and four of its competitors is used to validate the proposed framework. The proposed EA-TM and E-SA achieved 17–23% and 9–20% higher classification accuracy over individual benchmark models, respectively. The results reveal 11 different aspects of airline service quality from the OCR, airline-specific sentiment summary towards each aspect, and root causes for passenger satisfaction/dissatisfaction for each identified topic. Finally, several theoretical and managerial implications for improving airline services are derived based on the results.

Automatic Component Prediction for Issue Reports Using Fine-Tuned Pretrained Language Models

Article

Full-text available

Jan 2022

Various issues or bugs are reported during the software development. It takes considerable effort, time, and cost for the software developers to triage these issues manually. Many previous studies have proposed various method to automate the triage process by predicting component using word-based language models. However, these methods still suffer from unsatisfactory performance due to their structural limitations and ignorance of the word context. In this paper, we propose a novel technique based on pretrained language models and it aims to predict a component of an issue report. Our approach fine-tunes the pretrained language models to conduct multilabel classifications. The proposed approach outperforms the previous state-of-the-art method by more than 30% with respect to the recall at k on all the datasets considered in our experiment. This improvement suggests that fine-tuned pretrained language models can help us to predict issue components effectively.

Topic-aware hierarchical multi-attention network for text classification

Article

Full-text available

Dec 2022

Neural networks, primarily recurrent and convolutional Neural networks, have been proven successful in text classification. However, convolutional models could be limited when classification tasks are determined by long-range semantic dependency. While the recurrent ones can capture long-range dependency, the sequential architecture of which could constrain the training speed. Meanwhile, traditional networks encode the entire document in a single pass, which omits the hierarchical structure of the document. To address the above issues, this study presents T-HMAN, a Topic-aware Hierarchical Multiple Attention Network for text classification. A multi-head self-attention coupled with convolutional filters is developed to capture long-range dependency via integrating the convolution features from each attention head. Meanwhile, T-HMAN combines topic distributions generated by Latent Dirichlet Allocation (LDA) with sentence-level and document-level inputs respectively in a hierarchical architecture. The proposed model surpasses the accuracies of the current state-of-the-art hierarchical models on five publicly accessible datasets. The ablation study demonstrates that the involvement of multiple attention mechanisms brings significant improvement. The current topic distributions are fixed vectors generated by LDA, the topic distributions will be parameterized and updated simultaneously with the model weights in future work.

Deciphering Latent Health Information in Social Media Using a Mixed-Methods Design

Article

Full-text available

Nov 2022

Natural language processing techniques have increased the volume and variety of text data that can be analyzed. The aim of this study was to identify the positive and negative topical sentiments among diet, diabetes, exercise, and obesity tweets. Using a sequential explanatory mixed-method design for our analytical framework, we analyzed a data corpus of 1.7 million diet, diabetes, exercise, and obesity (DDEO)-related tweets collected over 12 months. Sentiment analysis and topic modeling were used to analyze the data. The results show that overall, 29% of the tweets were positive, and 17% were negative. Using sentiment analysis and latent Dirichlet allocation (LDA) topic modeling, we analyzed 800 positive and negative DDEO topics. From the 800 LDA topics—after the qualitative and computational removal of incoherent topics—473 topics were characterized as coherent. Obesity was the only query health topic with a higher percentage of negative tweets. The use of social media by public health practitioners should focus not only on the dissemination of health information based on the topics discovered but also consider what they can do for the health consumer as a result of the interaction in digital spaces such as social media. Future studies will benefit from using multiclass sentiment analysis methods associated with other novel topic modeling approaches.

Synergy Masks of Domain Attribute Model DaBERT: Emotional Tracking on Time-Varying Virtual Space Communication

Article

Full-text available

Nov 2022
SENSORS-BASEL

Emotional tracking on time-varying virtual space communication aims to identify sentiments and opinions expressed in a piece of user-generated content. However, the existing research mainly focuses on the user’s single post, despite the fact that social network data are sequential. In this article, we propose a sentiment analysis model based on time series prediction in order to understand and master the chronological evolution of the user’s point of view. Specifically, with the help of a domain-knowledge-enhanced pre-trained encoder, the model embeds tokens for each moment in the text sequence. We then propose an attention-based temporal prediction model to extract rich timing information from historical posting records, which improves the prediction of the user’s current state and personalizes the analysis of user’s sentiment changes in social networks. The experiments show that the proposed model improves on four kinds of sentiment tasks and significantly outperforms the strong baseline.

Rapid Text Retrieval and Analysis Supporting Latent Dirichlet Allocation Based on Probabilistic Models

Article

Full-text available

Aug 2022

Text mining, also known as text analysis, is the process of converting unstructured text data into meaningful and functional information. Text mining uses different AI technologies to automate data and generate valuable insights, allowing enterprises to make data-based decisions. Text mining enables the user to extract important content from text data sets. Text analysis encourages machine learning ability for research areas such as medical and pharmaceutical innovation fields. Apart from this, text analysis converts inaccessible data into a structured format, which can be used for further analysis. Text analysis emphasizes facts and relationships from large data sets. This information is extracted and converted into structured data for visualization, analysis, and integration as structured data and refines the information using machine-learning methods. Like most things related to Natural Language Processing, text mining can seem like a difficult concept to understand. But the fact is, it does not have to be. This research article will go through the basics of text mining, clarify its different methods and techniques, and make it easier to understand how it works. We implemented Latent Dirichlet Allocation techniques for mining the data from the data set; it works properly and will be in future development data mining techniques.

Integrating Image Clustering and Codebook Learning

Article

Feb 2015

Image clustering and visual codebook learning are two fundamental problems in computer vision and they are tightly related. On one hand, a good codebook can generate effective feature representations which largely affect clustering performance. On the other hand, class labels obtained from image clustering can serve as supervised information to guide codebook learning. Traditionally, these two processes are conducted separately and their correlation is generally ignored.In this paper, we propose a Double Layer Gaussian Mixture Model (DLGMM) to simultaneously perform image clustering and codebook learning. In DLGMM, two tasks are seamlessly coupled and can mutually promote each other. Cluster labels and codebook are jointly estimated to achieve the overall best performance. To incorporate the spatial coherence between neighboring visual patches, we propose a Spatially Coherent DLGMM which uses a Markov Random Field to encourage neighboring patches to share the same visual word label.We use variational inference to approximate the posterior of latent variables and learn model parameters.Experiments on two datasets demonstrate the effectiveness of two models.

A macro perspective of the perceptions of the education system via topic modelling analysis

Article

Full-text available

Jun 2022
MULTIMED TOOLS APPL

Education quality has become an important issue and has received considerable attention around the world, especially due to its relevant repercussions on the socio-economical development of society. In recent years, many nations have realized the need for a highly skilled workforce to thrive in the emerging knowledge-based economy. They have consequently adopted strategies to identify the lines of action to improve the education quality. In response to the government’s efforts to improve the education quality in Colombia, this study examines the current perceptions of the education system from the perspective of key local stakeholders. Therefore, we used a survey that contained open-ended questions to collect information about the limitations and difficulties of the education process for several groups of participants. The collected answers were categorized into a variety of topics using a Latent Dirichlet Allocation based model. Consequently, the students’, teachers’ and parents’ answers were analyzed separately to obtain a general landscape of the perceptions of the education system. Evaluation metrics, such as topic coherence, were quantitatively analyzed to assess the modelling performance. In addition, a methodology for the hyper-parameters setting and the final topic labelling was presented. The results suggest that topic modelling strategies are a viable alternative to identify strategic lines of action and to obtain a macro-perspective of the perceptions of the education system.

Topic modeling revisited: New evidence on algorithm performance and quality metrics

Article

Full-text available

Apr 2022
PLOS ONE

Topic modeling is a popular technique for exploring large document collections. It has proven useful for this task, but its application poses a number of challenges. First, the comparison of available algorithms is anything but simple, as researchers use many different datasets and criteria for their evaluation. A second challenge is the choice of a suitable metric for evaluating the calculated results. The metrics used so far provide a mixed picture, making it difficult to verify the accuracy of topic modeling outputs. Altogether, the choice of an appropriate algorithm and the evaluation of the results remain unresolved issues. Although many studies have reported promising performance by various topic models, prior research has not yet systematically investigated the validity of the outcomes in a comprehensive manner, that is, using more than a small number of the available algorithms and metrics. Consequently, our study has two main objectives. First, we compare all commonly used, non-application-specific topic modeling algorithms and assess their relative performance. The comparison is made against a known clustering and thus enables an unbiased evaluation of results. Our findings show a clear ranking of the algorithms in terms of accuracy. Secondly, we analyze the relationship between existing metrics and the known clustering, and thus objectively determine under what conditions these algorithms may be utilized effectively. This way, we enable readers to gain a deeper understanding of the performance of topic modeling techniques and the interplay of performance and evaluation metrics.

Identifying the Directions of Technology-Driven Government Innovation

Article

Full-text available

Apr 2022

The world is now strengthening its Information and Communication Technology (ICT) capabilities to secure economic growth and national competitiveness. The role of ICT is important for problems like COVID-19. ICT based innovation is effective in responding to problems for industry, economy, and society. However, we need to understand, not from the perspective of performance or investment, that the use and performance of ICT technology are promoted when each country’s ICT related environment, policies, governance, and regulations are effective. We need to share sustainable ICT experiences, successes, and challenges to solve complex problems and reorganize policies. This study proposes a Text Mining methodology from a future-oriented perspective to extract semantic system patterns from International Telecommunication Union (ITU) professional reports. In the text extracted from the report, we found a new relationship pattern and a potential topic. The research results provide insights into a diverse perspective for policymakers to search for successful ICT strategies.

Measuring the innovation of method knowledge elements in scientific literature

Article

Mar 2022

Interest in assessing research impacts is increasing due to its importance for informing actions and funding allocation decisions. The level of innovation (also called “innovation degree” in the following article), one of the most essential factors that affect scientific literature’s impact, has also received increasing attention. However, current studies mainly focus on the overall innovation degree of scientific literature at the macro level, while ignoring the innovation degree of a specific knowledge element (KE), such as the method knowledge element (MKE). A macro level view causes difficulties in identifying which part of the scientific literature contains the innovations. To bridge this gap, a more fine-grained evaluation of academic papers is urgent. The fine-grained evaluation method can ensure the quality of a paper before being published and identify useful knowledge content in a paper for academic users. Different KEs can be used to perform the fine-grained evaluation. However, MKEs are usually considered as one of the most essential knowledge elements among all KEs. Therefore, this study proposes a framework to measure the innovation degree of method knowledge elements (MIDMKE) in scientific literature. In this framework, we first extract the MKEs using a rule-based approach and generate a cloud drop for each MKE using the biterm topic model (BTM). The generated cloud drop is then used to create a method knowledge cloud (MKC) for each MKE. Finally, we calculate the innovation score of a MKE based on the similarity between it and other MKEs of its type. Our empirical study on a China National Knowledge Infrastructure (CNKI) academic literature dataset shows the proposed approach can measure the innovation of MKEs in scientific literature effectively. Our proposed method is useful for both reviewers and funding agencies to assess the quality of academic papers. The dataset, the code for implementation the algorithms, and the complete experiment results will be released at: https://github.com/haihua0913/midmke.

Public Service Innovation Using Smart Governance

Article

Full-text available

Jan 2022

Purpose: This study determines the possibility of public service innovation to meet the rapid changes in information technology (IT) and the need for new governance by analyzing three cases in South Korea. Methodology: The Smart Governance-Decision Support Systems (SG-DSS) in this study is a new form that guarantees the voluntary participation of citizens by applying IT to governance. SG-DSS supports the demand response that fulfills universal values and decisions about priorities by collecting citizens’ needs. It also encourages citizens or stakeholders to participate in establishing implementation plans that are more specific and fit for reality, giving legitimacy to public service policies and developing them into a driving force. Findings: The three case studies on Korean public policies show how public opinions reflect public service policies. Therefore, the findings of this study could lay the foundation for customized public services based on intelligent citizen participation by overcoming the current limitations. Unique contribution to theory, practice and policy: SG-DSS supports the demand response that fulfills universal values and decisions about priorities by collecting citizens’ needs. It also encourages citizens or stakeholders to participate in establishing implementation plans that are more specific and fit for reality, giving legitimacy to public service policies and developing them into a driving force. The core value of smart governance is to apply IT innovations such as big data and AI to public services. Furthermore, advanced technology enables the collection and application of actual public opinions, thereby improving public to be more objective and efficient.

The automation of relevant trial registration screening for systematic review updates: an evaluation study on a large dataset of ClinicalTrials.gov registrations

Article

Full-text available

Dec 2021
BMC MED RES METHODOL

Background Clinical trial registries can be used as sources of clinical evidence for systematic review synthesis and updating. Our aim was to evaluate methods for identifying clinical trial registrations that should be screened for inclusion in updates of published systematic reviews. Methods A set of 4644 clinical trial registrations (ClinicalTrials.gov) included in 1089 systematic reviews (PubMed) were used to evaluate two methods (document similarity and hierarchical clustering) and representations (L2-normalised TF-IDF, Latent Dirichlet Allocation, and Doc2Vec) for ranking 163,501 completed clinical trials by relevance. Clinical trial registrations were ranked for each systematic review using seeding clinical trials, simulating how new relevant clinical trials could be automatically identified for an update. Performance was measured by the number of clinical trials that need to be screened to identify all relevant clinical trials. Results Using the document similarity method with TF-IDF feature representation and Euclidean distance metric, all relevant clinical trials for half of the systematic reviews were identified after screening 99 trials (IQR 19 to 491). The best-performing hierarchical clustering was using Ward agglomerative clustering (with TF-IDF representation and Euclidean distance) and needed to screen 501 clinical trials (IQR 43 to 4363) to achieve the same result. Conclusion An evaluation using a large set of mined links between published systematic reviews and clinical trial registrations showed that document similarity outperformed hierarchical clustering for identifying relevant clinical trials to include in systematic review updates.

Understanding Patterns of COVID Infodemic: A Systematic and Pragmatic Approach to Curb Fake News

Article

Nov 2021
J BUS RES

Amid the flood of fake news on Coronavirus disease of 2019 (COVID-19), now referred to as COVID-19 infodemic, it is critical to understand the nature and characteristics of COVID-19 infodemic since it not only results in altered individual perception and behavior shift such as irrational preventative actions but also presents imminent threat to the public safety and health. In this study, we build on First Amendment theory, integrate text and network analytics and deploy a three-pronged approach to develop a deeper understanding of COVID-19 infodemic. The first prong uses Latent Direchlet Allocation (LDA) to identify topics and key themes that emerge in COVID-19 fake and real news. The second prong compares and contrasts different emotions in fake and real news. The third prong uses network analytics to understand various network-oriented characteristics embedded in the COVID-19 real and fake news such as page rank algorithms, betweenness centrality, eccentricity and closeness centrality. This study carries important implications for building next generation trustworthy technology by providing strong guidance for the design and development of fake news detection and recommendation systems for coping with COVID-19 infodemic. Additionally, based on our findings, we provide actionable system focused guidelines for dealing with immediate and long-term threats from COVID-19 infodemic.

Keyword Network Analysis of Domestic Research Articles for Determining Recent Trends of Hydrogen Refueling Stations

Article

Aug 2021

Ohk Kun Lim

Global climate change has caused various natural disasters, which has resulted in serious damage to the society. Therefore, there has been a growing interest in utilizing eco-friendly energy sources such as hydrogen fuel. Hydrogen vehicles and infrastructures have been studied extensively. However, the research trends of hydrogen refueling stations have not been systematically analyzed using text mining with domestic research articles. The keyword network and research topics were analyzed based on the Korea Citation Index (KCI) data of the past 10 years. The analysis revealed that “hydrogen refueling station,” “fuel cell,” and “charging station” are new research keywords. Furthermore, topics such as “hydrogen storage,” “hydrogen and electric vehicle,” and “safety in hydrogen refueling station” are becoming increasingly popular. These quantitative analysis results provide an insight on the development of hydrogen infrastructure and research policy.

Confirmatory aspect-level opinion mining processes for tourism and hospitality research: a proposal of DiSSBUS

Article

Sep 2021

We proposed a new rule-based text analysis method to effectively summarize and transform unstructured user-generated content (online customer reviews) into an analysable form for tourism and hospitality research. To differentiate this method, we developed the Disintegrating, Summarizing, Straining, Bagging, Upcycling, and Scoring – DiSSBUS – algorithm which can address the following problems in previous approaches: (1) false identification of irrelevant aspect terms, (2) improper handling of multiple aspects and sentiments within a text unit, and (3) data sparsity. The algorithm’s distinctive advantage is to decompose a single review into a set of bi-terms related to the aspects that are pre-specified based on domain knowledge. Therefore, this algorithm can identify customer opinions on specific aspects, which allows to extract variables of interest from online reviews. To evaluate the performance of our confirmatory aspect-level opinion-mining algorithm, we applied it to customer reviews on restaurants in Hawaii. The findings from the empirical test validated its effectiveness.

Text Analytics-based Business Area Identification for Patent-Trademark Linkage Business Intelligence

Article

Feb 2024

Supervised probabilistic latent semantic analysis with applications to controversy analysis of legislative bills

Article

Full-text available

Nov 2023
INTELL DATA ANAL

Probabilistic Latent Semantic Analysis (PLSA) is a fundamental text analysis technique that models each word in a document as a sample from a mixture of topics. PLSA is the precursor of probabilistic topic models including Latent Dirichlet Allocation (LDA). PLSA, LDA and their numerous extensions have been successfully applied to many text mining and retrieval tasks. One important extension of LDA is supervised LDA (sLDA), which distinguishes itself from most topic models in that it is supervised. However, to the best of our knowledge, no prior work extends PLSA in a similar manner sLDA extends LDA by jointly modeling the contents and the responses of documents. In this paper, we propose supervised PLSA (sPLSA) which can efficiently infer latent topics and their factorized response values from the contents and the responses of documents. The major challenge lies in estimating a document’s topic distribution which is a constrained probability that is dictated by both the content and the response of the document. To tackle this challenge, we introduce an auxiliary variable to transform the constrained optimization problem to an unconstrained optimization problem. This allows us to derive an efficient Expectation and Maximization (EM) algorithm for parameter estimation. Compared to sLDA, sPLSA converges much faster and requires less hyperparameter tuning, while performing similarly on topic modeling and better in response factorization. This makes sPLSA an appealing choice for latent response analysis such as ranking latent topics by their factorized response values. We apply the proposed sPLSA model to analyze the controversy of bills from the United States Congress. We demonstrate the effectiveness of our model by identifying contentious legislative issues.

Research on the Theme Evolution and Subject Characteristics of Network Public Opinion Based on Text Analysis

Article

Full-text available

Jan 2023

文晴许

Application research of English vocabulary Game system for art students based on Big data technology

Conference Paper

Jun 2023

Ping Zhang

Automatic Construction of Classification Dimensions by Clustering Texts based on Common Words

Article

Oct 2023
EXPERT SYST APPL

Text mining-aided meta-research on nutrient dynamics in surface water and groundwater: Popular topics and perceived gaps

Article

Oct 2023
J HYDROL

Bibliometric Analysis of Research Topics on Fire Safety

Article

Feb 2023

Ohk Kun Lim

Various studies have been conducted to minimize property damage and casualties caused by fire accidents, which occur over 40,000 times annually on average. The research papers published in the Fire Science and Engineering journal indexed in the Korea Citation Index and the Fire Technology journal indexed in the Science Citation Index Expanded databases over the last 10 years were analyzed utilizing text-mining techniques. Similar research papers published these two journals were explored and significant differences in perceptions were identified. Recently, the proportion of studies on “Wild-Urban-Interface (WUI) fire” and “Battery fire” published in the Fire Technology journal has increased; however, “Firefighter organization” related research papers are being actively published in the Fire Science and Engineering journal. A quantitative analysis of the studies on fire incidents can provide significant information on developing new policies essential to reducing fire damage.

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

Chapter

Nov 2022

The success of topic modeling algorithms depends on their ability to analyze, index and classify large text corpora. These algorithms could be classified into two groups where the first one is oriented to classify textual corpus according to their dominant topics such as LDA, LSA and PLSA which are the most known techniques. The second group is dedicated to extract the relationships among the generated topics like HLDA, PAM and CTM. However, each algorithm among these groups is dedicated to a single task and there is no technique that makes it possible to carry out several analyses on textual corpus at the same time. In order to cope with this problem, we propose here a new technique based on LDA topic modeling to automatically classify a large text corpora according to their relevant topics, discover new topics (sub-topics) based on the extracted ones and hierarchy the generated topics in order to analyse data more deeply. Experiments have been conducted to measure the performance of our solution compared to the existing techniques. The results obtained are more than satisfactory.

Hydrology research articles are becoming more topically diverse

Article

Oct 2022
J HYDROL

We used Natural Language Processing (NLP) to assess topic diversity in all research articles (∼75,000) from eighteen water science and hydrology journals published between 1991 and 2019. We found that individual water science and hydrology research articles are becoming increasingly diverse in the sense that, on average, the number of topics represented in individual articles is increasing, which may be a sign of increasing interdisciplinarity. This is true even though the body of water science and hydrology literature as a whole is not becoming more topically diverse. Topics with the largest increases in popularity were Climate Change Impacts, Water Policy & Planning, and Pollutant Removal. Topics with the largest decreases in popularity were Stochastic Models and Numerical Models. At a journal level, Water Resources Research, Journal of Hydrology, and Hydrological Processes are the three most topically diverse journals among the corpus that we studied.

Drug Efficacy Prediction in Tumors Based on LDA Model

Conference Paper

Jul 2022

Topic Detection and Tracking Towards Determining Public Agenda Items: The Impact of Named Entities on Event-Based News Clustering

Chapter

Jun 2022

It is a known fact that all of the events that people in the society are exposed to while continuing their lives have important effects on their quality of life. Events that have significant effects on a large part of the society are shared with the public through news texts. With a perspective that keeps up with the digital age, the problem of automatic detection and tracking of events in the news with natural language processing methods is discussed. An event-based news clustering approach is presented for data regimentation, which is necessary to extract meaningful information from news in the form of heaps in online environments. In this approach, it is aimed to increase clustering performance and speed by making use of named entities. Additionally, an event-based text clustering dataset was created by the researchers and brought to the literature. By using the B-cubed evaluation metric on this test dataset, which consists of 930 different event groups and has a total of 19,848 news, a solution to the event-based text clustering problem was provided with an F-score of over 85%.

Local and global topics in text modeling of web pages nested in web sites

Article

Sep 2022
COMPUT STAT DATA AN

Topic models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection has documents nested inside a higher order structure such as articles nested in journals, podcasts within authors, or web pages nested in web sites. In a single collection of documents, topics are global or shared across all documents. For web pages nested in web sites, topic frequencies likely vary across web sites and within a web site, topic frequencies almost certainly vary from web page to web page. A hierarchical prior for topic frequencies models this hierarchical structure with a global topic distribution, web site topic distributions varying around the global topic distribution, and web page topic distributions varying around the web site topic distribution. Web pages in one United States local health department web site often contain local geographic and news topics not found on web pages of other local health department web sites. For web pages nested in web sites, some topics are likely local topics and unique to an individual web site. Regular topic models ignore the nesting structure and may identify local topics but cannot label those topics as local nor identify the corresponding web site owner. Explicitly modeling local topics identifies the owning web site and identifies the topic as local. In US health web site data, topic coverage is defined at the web site level after removing local topic words from pages. Hierarchical local topic models can be used to study how well health topics are covered.

Word Embedding-based Web Service Representations for Classification and Clustering

Conference Paper

Sep 2021

Extracting the location of flooding events in urban systems and analyzing the semantic risk using social sensing data

Article

Full-text available

Nov 2021
J HYDROL

The aggregation of the same type of socio-economic activities in urban space generates urban functional zones, each of which has one function as the main (e.g., residential, educational or commercial), and is an important part of the city. With the development of deep learning technology in the field of remote sensing, the accuracy of land use decoding has been greatly improved. However, no finer remote sensing image could directly obtain economic and social information and it has a high revisit cycle (low temporal resolution), while urban flooding often lasts only a few hours. Cities contain a large amount of ”social sensing” data that records human socio-economic activities, and GIS is a natural discipline with strong socio-economic ties. We propose a new GeoSemantic2vec algorithm for urban function recognition based on the latest advances in natural language processing technology (BERT model), which utilizes the rich semantic information in urban POI data to portray urban functions. Taking the Wuhan flooding event in summer 2020 as an example, we identified 84.55% of the flooding locations in social media. We also use the new algorithm proposed in this paper to divide the main urban area of Wuhan into 8 types of urban functional zones (kappa coefficient is 0.615) and construct a ”City Portrait” of flooding locations. This paper summarizes the progress of existing research on urban function identification using natural language processing techniques and proposes a better algorithm, which is of great value for urban flood location detection and risk assessment.

Effective interrelation of Bayesian nonparametric document clustering and embedded-topic modeling

Article

Dec 2021
KNOWL-BASED SYST

Topic modeling can be synergically interrelated with document clustering. We present an innovative unsupervised approach to the interrelationship of topic modeling with document clustering. The devised approach exploits Bayesian generative modeling and posterior inference, to seamlessly unify and jointly carry out the two tasks, respectively. Specifically, a Bayesian nonparametric model of text collections, formulates an unprecedented interrelationship of word-embedding topics with a Dirichlet process mixture of cluster components. The latter enables countably infinite clusters and permits the automatic inference of their actual number in a statistically principled manner. All latent clusters and topics under the foresaid model are inferred through collapsed Gibbs sampling and parameter estimation. An extensive empirical study of the presented approach is effected on benchmark real-world corpora of text documents. The experimental results demonstrate its higher effectiveness in partitioning text collections and coherently discovering their semantics, compared to state-of-the-art competitors and tailored baselines. Computational efficiency is also looked into under different conditions, in order to provide an insightful analysis of scalability.

A mixture model for contextual text mining

Conference Paper

Full-text available

Aug 2006

Contextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. Since the topics covered in a document are usually related to the context of the doc- ument, analyzing topical themes within context can poten- tially reveal many interesting theme patterns. In this paper, we propose a new general probabilistic model for contextual text mining that can cover several existing models as special cases. Specifically, we extend the probabilistic latent seman- tic analysis (PLSA) model by introducing context variables to model the context of a document. The proposed mixture model, called contextual probabilistic latent semantic anal- ysis (CPLSA) model, can be applied to many interesting mining tasks, such as temporal text mining, spatiotempo- ral text mining, author-topic analysis, and cross-collection comparative analysis. Empirical experiments show that the proposed mixture model can discover themes and their con- textual variations effectively.

A cross-collection mixture model for comparative text mining

Conference Paper

Full-text available

Aug 2004

In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

Latent Dirichlet Allocation

Conference Paper

Full-text available

Jan 2001

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

Conference Paper

Full-text available

Jan 2008
Adv Neural Inform Process Syst

Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a re duced dimensional- ity representation. Specifically, we present DiscLDA, a dis criminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transforma- tion is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mix- ture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a docu- ment collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsu- pervised LDA on the 20 Newsgroups document classification ta sk and show how our model can identify shared topics across classes as well as class-dependent topics.

Reading Tea Leaves: How Humans Interpret Topic Models

Conference Paper

Full-text available

Jan 2009
Adv Neural Inform Process Syst

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide explo- ration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

Modeling hidden topics on document manifold

Conference Paper

Full-text available

Oct 2008

Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each docu- ment on the hidden topics independently and the number of param- eters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allo- cation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evi- dence that the document space is Euclidean, or flat . Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of topic modeling on intrinsic document manifold. Specif- ically, we propose a novel algorithm called Laplacian Probabilistic Latent Semantic Indexing (LapPLSI) for topic modeling. LapPLSI models the document space as a submanifold embedded in the am- bient space and directly performs the topic modeling on this doc- ument manifold in question. We compare the proposed LapPLSI approach with PLSI and LDA on three text data sets. Experimen- tal results show that LapPLSI provides better representation in the sense of semantic structure.

Relation between PLSA and NMF and implications

Conference Paper

Full-text available

Aug 2005

Non-negative Matrix Factorization (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been successfully applied to a number of text analysis tasks such as document clustering. Despite their different inspirations, both methods are instances of multinomial PCA [1]. We further explore this relationship and first show that PLSA solves the problem of NMF with KL divergence, and then explore the implications of this relationship.

On an equivalence between PLSI and LDA

Conference Paper

Full-text available

Jul 2003

Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework.

Topic modeling with network regularization

Conference Paper

Full-text available

Apr 2008

In this paper, we formally define the problem of topic mod- eling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph struc- ture in the data. The proposed method combines topic mod- eling and social network analysis, and leverages the power of both statistical topic models and discrete regularization. The output of this model can summarize well topics in text, map a topic onto the network, and discover topical commu- nities. With appropriate instantiations of the topic model and the graph-based regularizer, our model can be applied to a wide range of text mining problems such as author- topic analysis, community discovery, and spatial text min- ing. Empirical experiments on two data sets with different genres show that our approach is effective and outperforms both text-oriented methods and network-oriented methods alone. The proposed model is general; it can be applied to any text collections with a mixture of topics and an associ- ated network structure.

Finding Scientific Topics

Article

Full-text available

Apr 2004

A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying "hot topics" by examining temporal dynamics and tagging abstracts to illustrate semantic content.

Probabilistic Latent Semantic Indexing

Article

Full-text available

Apr 2004

Thomas Hofmann

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain#speci#c synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing #LSI# by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and de#nes a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with di#erent dimensionalities has proven to be advantageous. 1

Probabilistic Author-Topic Models for Information Discovery

Article

Full-text available

Jul 2004

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.

Hierarchical Topic Models and the Nested Chinese Restaurant Process

Article

Full-text available

May 2004
Adv Neural Inform Process Syst

We address the problem of learning topic hierarchies from data. The model selection problem in this domain is daunting---which of the large collection of possible trees to use? We take a Bayesian approach, generating an appropriate prior via a distribution on partitions that we refer to as the nested Chinese restaurant process. This nonparametric prior allows arbitrarily large branching factors and readily accommodates growing data collections. We build a hierarchical topic model by combining this prior with a likelihood that is based on a hierarchical variant of latent Dirichlet allocation. We illustrate our approach on simulated data and with an application to the modeling of NIPS abstracts.

A study of smoothing methods for language models applied to ad hoc information retrieval

Article

Jan 2001

Probabilistic topic models. Handb. Latent Semant

Article

Jan 2007

Topics over time

Conference Paper

Aug 2006

This paper presents an LDA-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. Unlike other recent work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the document's timestamp. Thus, the meaning of a particular topic can be relied upon as constant, but the topics' occurrence and correlations change significantly over time. We present results on nine months of personal email, 17 years of NIPS research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends.

Stochastic Relaxation, Gibbs Distributions and the Bayesian Resoration of Images

Article

Jun 1984

We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

Probabilistic Latent Semantic Analysis

Conference Paper

Jan 1999

T. Hofmann

Expectation-Propogation for the Generative Aspect Model

Article

Dec 2012

The generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents. Previous results with aspect models have been promising, but hindered by the computational difficulty of carrying out inference and learning. This paper demonstrates that the simple variational methods of Blei et al (2001) can lead to inaccurate inferences and biased learning for the generative aspect model. We develop an alternative approach that leads to higher accuracy at comparable cost. An extension of Expectation-Propagation is used for inference and then embedded in an EM algorithm for learning. Experimental results are presented for both synthetic and real data sets.

Maximum Likelihood from Incomplete Data Via EM Algorithm

Article

Sep 1977

S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Stochastic Relaxation, Gibbs Distributions,and the Bayesian Restoration of Images

Article

Jan 1984

Donald Geman

Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends

Conference Paper

Jan 2006

Joint latent topic models for text and citations

Conference Paper

Aug 2008

In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models. The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model. Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models.

Indian Buffet Processes with Power-law Behavior.

Conference Paper

Jan 2009

The Indian buffet process (IBP) is an exchangeable distribution over binary ma- trices used in Bayesian nonparametric featural models. In this paper we propose a three-parameter generalization of the IBP exhibiting power-law behavior. We achieve this by generalizing the beta process (the de Finetti measure of the IBP) to the stable-beta process and deriving the IBP corresponding to it. We find interest- ing relationships between the stable-beta process and the Pitman-Yor process (an- other stochastic process used in Bayesian nonparametric models with interesting power-law properties). We derive a stick-breaking construction for the stable-beta process, and find that our power-law IBP is a good model for word occurrences in document corpora.

Correlated Topic Models

Conference Paper

Jan 2005
Adv Neural Inform Process Syst

Topic models, such as latent Dirichlet allocation (LDA), have been an ef- fective tool for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vo- cabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about sports is more likely to also be about health than international finance. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution (1). We derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. The CTM gives a better fit than LDA on a collection of OCRed articles from the journal Science. Furthermore, the CTM provides a natural way of visualizing and exploring this and other unstructured data sets.

A Comparative Study of Utilizing Topic Models for Information Retrieval

Conference Paper

Apr 2009

We explore the utility of different types of topic models for retrieval purposes. Based on prior work, we describe several ways that topic models can be integrated into the retrieval process. We evaluate the effectiveness of different types of topic models within those retrieval approaches. We show that: (1) topic models are effective for document smoothing; (2) more rigorous topic models such as Latent Dirichlet Allo- cation provide gains over cluster-based models; (3) more elaborate topic models that capture topic dependencies provide no additional gains; (4) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (5) doing query expansion should utilize topics discovered in the top feedback documents instead of coarse-grained topics from the whole corpus; (6) generally, incorporating topics in the feedback documents for building relevance models can ben- efit the performance more for queries that have more relevant documents.

Pachinko allocation: DAG-structured mixture models of topic correlations

Conference Paper

Jun 2006

Latent Dirichlet allocation (LDA) and other related topic models are increasingly popu- lar tools for summarization and manifold dis- covery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko alloca- tion model (PAM), which captures arbitrary, nested, and possibly sparse correlations be- tween topics using a directed acyclic graph (DAG). The leaves of the DAG represent in- dividual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other in- terior nodes (topics). PAM provides a ex-

Evaluation methods for topic models

Conference Paper

Jun 2009

A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is in- tractable, several estimators for this prob- ability have been used in the topic model- ing literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonly-used methods are unlikely to accurately estimate the probability of held- out documents, and propose two alternative methods that are both accurate and ecient. In this paper we consider only the simplest topic model, latent Dirichlet allocation (LDA), and compare a number of methods for estimating the probability of held-out documents given a trained model. Most of the methods presented, however, are applicable to more complicated topic models. In addition to com- paring evaluation methods that are currently used in the topic modeling literature, we propose several al- ternative methods. We present empirical results on synthetic and real-world data sets showing that the currently-used estimators are less accurate and have higher variance than the proposed new estimators.

LDA-based document models for Ad-hoc retrieval

Conference Paper

Aug 2006

Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has re cently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is he avily cited in the machine learning literature, but its feasibilit y and effectiveness in information retrieval is mostly un known. In this paper, we study how to efficiently use LDA to impro ve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

Document Clustering Based On Non-negative Matrix Factorization

Conference Paper

Jul 2003

In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.

Document Clustering Based on Spectral Clustering and Non-negative Matrix Factorization

Conference Paper

Jun 2008

In this paper, we propose a novel non-negative matrix factorization (NMF) to the affinity matrix for document clustering, which enforces non-negativity and orthogonality constraints simultaneously. With the help of orthogonality constraints, this NMF provides a solution to spectral clustering, which inherits the advantages of spectral clustering and presents a much more reasonable clustering interpretation than the previous NMF-based clustering methods. Furthermore, with the help of non-negativity constraints, the proposed method is also superior to traditional eigenvector-based spectral clustering, as it can inherit the benefits of NMF-based methods that the non-negative solution is institutive, from which the final clusters could be directly derived. As a result, the proposed method combines the advantages of spectral clustering and the NMF-based methods together, and hence outperforms both of them, which is demonstrated by experimental results on TDT2 and Reuters-21578 corpus.

Latent Dirichlet Allocation

Article

May 2003
J MACH LEARN RES

Geman, D.: Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6(6), 721-741

Article

Nov 1984

Probabilistic Latent Semantic Indexing

Conference Paper

Aug 1999

Thomas Hofmann

Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs

Conference Paper

May 2007

In this paper, we define the problem of topic-sentiment anal- ysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent topical facets in a Weblog collection, the subtopics in the results of an ad hoc query, and their asso- ciated sentiments. It could also provide general sentiment models that are applicable to any ad hoc topics. With a specifically designed HMM structure, the sentiment mod- els and topic models estimated with TSM can be utilized to extract topic life cycles and sentiment dynamics. Em- pirical experiments on different Weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from Weblog col- lections. The TSM model is quite general; it can be applied to any text collections with a mixture of topics and senti- ments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction.

Expectation-Propagation for the Generative Aspect Model

Article

Jul 2002

The generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents.

Model-based Feedback in the Language Modeling Approach to Information Retrieval

Article

Oct 2001

The language modeling approach to retrieval has been shown to perform well empirically. One advantage of this new approach is its statistical foundations. However, feedback, as one important component in a retrieval system, has only been dealt with heuristically in this new retrieval approach: the original query is usually literally expanded by adding ditional terms to it. Such expansion-based feedback creates an inconsistent interpretation of the original and the expanded query. In this paper, we present a more principled approach to feedback in the language modeling approach. Specifically, we treat feedback as updating the query language model based on the extra evidence carried by the feedback documents. Such a model-based feedback strategy easily fits into an extension of the language modeling approach. We propose and evaluate two different approaches to updating a query language model based on feedback documents, one based on a generarive probabilistic model of feedback documents and one based on minimization of the KL-divergence over feedback documents. Experiment resuits show that both approaches are effective and outperform the Rocchio feedback approach.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

Article

Aug 2001

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. A core problem in language model estimation is smoothing , which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections.

Pachinko allocation: Dag-structured mixture models of topic correlations Topic modeling with network regularization

Jan 2006
577-584

W Li
A Mccallum
Acm
Q Mei
D Cai
D Zhang
C Zhai

Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06 (pp. 577–584). ACM. Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. In WWW '08: Proceeding of the 17th international conference on World Wide Web (pp. 101–110). New York, NY, USA: ACM.

Modeling hidden topics on document manifold Reading tea leaves: How humans interpret topic models

Jan 2008

D Cai
Q Mei
J Han
C Zhai

Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), CIKM (pp. 911–920). ACM. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems. MIT Press.

Hierarchical topic models and the nested chinese restaurant process In Advances in neural information processing systems Correlated topic models Latent dirichlet allocation

Jan 2003
993-1022

D M Blei
T L Griffiths
M I Jordan
J B Tenenbaum
D M Blei
J D Lafferty

Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (p. 2003). MIT Press. Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In NIPS. MIT Press Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Investigating task performance of probabilistic topic models: An empirical study of PLSA and LDA

Abstract

No full-text available

Recommended publications

A Survey of Topic Modeling in Text Mining

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

TPMTM: Topic modeling over papers' abstract