Conference PaperPDF Available

Online Learning for Latent Dirichlet Allocation

Authors:

Abstract and Figures

We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
Content may be subject to copyright.
A preview of the PDF is not available
... Latent Dirichlet allocation (LDA) [32,33] is based on a bag of words (BoW) model. This approach, which has a long history in NLP research [34], represents each text in the dataset as the count of the unique words in that document. ...
... But, with enough documents and enough words, statistical regularities can emerge-that is, certain documents may use similar words, putting them at similar "locations" in the high-dimensional "word space." Topic modeling techniques like LDA allow one to use these regularities to extract latent topics, in the form of sets of words that frequently occur together [32,33]. Each document can then be scored based on the amount (prevalence) of these topics in its text, which, in turn, allows one to look for trends in topic prevalence across the dataset. ...
... Exploratory modeling showed that with increasing α, the mean and peak of the primary topic distribution for the embeddings model shifted toward 1 (as is shown in Fig. 10), and significantly more articles were classified as nearly 100% of a single topic. Furthermore, some differences are to be expected, as the two models are attending to different features of the text and arriving at their topic scores using very different mathematical models: LDA is attending to vocabulary use in the texts, and essentially performing a kind of probabilistic matrix factorization in order to extract a small number of "basis vectors" for this vocabulary [33]. Embeddings, in contrast, focus on the meanings of the texts (operationalized as aggregate relationships between the words) and provide topic scores based on text similarities in a high-dimensional "meaning space." ...
Article
Full-text available
We propose a technique for performing deductive qualitative data analysis at scale on text-based data. Using a natural language processing technique known as text embeddings, we create vector-based representations of texts in a high-dimensional meaning space within which it is possible to quantify differences in meaning as vector distances. To apply the technique, we build off prior work that used topic modeling via latent Dirichlet allocation to thematically analyze 18 years of the Physics Education Research Conference Proceedings literature. We first extend this analysis through 2023. Next, we create embeddings of all texts and, using representative articles from the 10 topics found by the LDA analysis, define centroids in the meaning space. We calculate the distances between every article and centroid and use the inverted, scaled distances between these centroids and articles to create an alternate topic model. We benchmark this model against the LDA model results and show that this embeddings model recovers most of the trends from that analysis. Finally, to illustrate the versatility of the method, we define eight new topic centroids derived from a review of the physics education research literature by Docktor and Mestre and reanalyze the literature using these researcher-defined topics. Based on these analyses, we critically discuss the features, uses, and limitations of this method and argue that it holds promise for flexible deductive qualitative analysis of a wide variety of text-based data that avoids many of the drawbacks inherent to prior NLP methods. Published by the American Physical Society 2024
... We compute the posteriors over z and θ by performing the first step in our algorithm using the test data and the perturbed sufficient statistics we obtain during training. We adapted the python implementation by the authors of (Hoffman et al., 2010) for our experiments. Figure 2 shows the trade-off between and per-word perplexity on the Wikipedia dataset for the different methods under a variety of conditions, in which we varied the value of σ ∈ {1.0, 1.1, 1.24, 1.5, 2} and the minibatch size S ∈ {5, 000, 10, 000, 20, 000}. ...
... In our case, p model is the posterior predictive distribution under our variational approximation to the posterior, which is intractable to compute. FollowingHoffman et al. (2010), we approximate perplexity based on the learned variational distribution, measured in nats, by plugging the ELBO into Equation 25, which results in an upper bound: ...
Preprint
Many applications of Bayesian data analysis involve sensitive information, motivating methods which ensure that privacy is protected. We introduce a general privacy-preserving framework for Variational Bayes (VB), a widely used optimization-based Bayesian inference method. Our framework respects differential privacy, the gold-standard privacy criterion, and encompasses a large class of probabilistic models, called the Conjugate Exponential (CE) family. We observe that we can straightforwardly privatise VB's approximate posterior distributions for models in the CE family, by perturbing the expected sufficient statistics of the complete-data likelihood. For a broadly-used class of non-CE models, those with binomial likelihoods, we show how to bring such models into the CE family, such that inferences in the modified model resemble the private variational Bayes algorithm as closely as possible, using the Polya-Gamma data augmentation scheme. The iterative nature of variational Bayes presents a further challenge since iterations increase the amount of noise needed. We overcome this by combining: (1) an improved composition method for differential privacy, called the moments accountant, which provides a tight bound on the privacy cost of multiple VB iterations and thus significantly decreases the amount of additive noise; and (2) the privacy amplification effect of subsampling mini-batches from large-scale data in stochastic learning. We empirically demonstrate the effectiveness of our method in CE and non-CE models including latent Dirichlet allocation, Bayesian logistic regression, and sigmoid belief networks, evaluated on real-world datasets.
... Each document has a separate collection of terminology, and each topic has its vocabulary. The objective of LDA is to figure out what subject a document belongs to based on the words it contains [7]. ...
Conference Paper
Full-text available
The Latent Dirichlet Allocation (LDA) topic model is employed for the purpose of categorizing textual content within a document into specific topics. The objective of this study is to construct a model that identifies themes related to students’ perceptions and attitudes towards the deployment of online classes. This was achieved by analyzing data collected from students at CSPC who are now experiencing the effects of the epidemic. The participants of the study were restricted to those who were currently enrolled as students in Polytechnic Colleges located in the province of Camarines Sur. Data collection was conducted through the utilization of questionnaires, which were distributed to the students via Google Forms. The present study employed a qualitative research design. The efficacy of the LDA algorithm in topic or theme identification was observed by the researchers. The utilization of coherence score and perplexity proved to be a dependable approach for testing the model. The visual depiction of the themes effectively highlights the significant keywords associated with each issue and yields favourable outcomes that accurately portray the extracted topics about the sentiments of the pupils. The selection of the topic name and the subsequent interpretation effectively elucidated the significance of each respective issue.
... LDA is a probabilistic approach which is widely used in the literature (e.g. [42,49,92]). NMF offers computational efficiency and a matrix-based perspective that is fast and effective for large datasets [99]. ...
Preprint
Full-text available
Social media constitutes a rich and influential source of information for qualitative researchers. Although computational techniques like topic modelling assist with managing the volume and diversity of social media content, qualitative researcher's lack of programming expertise creates a significant barrier to their adoption. In this paper we explore how BERTopic, an advanced Large Language Model (LLM)-based topic modelling technique, can support qualitative data analysis of social media. We conducted interviews and hands-on evaluations in which qualitative researchers compared topics from three modelling techniques: LDA, NMF, and BERTopic. BERTopic was favoured by 8 of 12 participants for its ability to provide detailed, coherent clusters for deeper understanding and actionable insights. Participants also prioritised topic relevance, logical organisation, and the capacity to reveal unexpected relationships within the data. Our findings underscore the potential of LLM-based techniques for supporting qualitative analysis.
... The authors in [7], [61] proposed to incorporate temporal information and [6] extends LDA by assuming correlations among topics. To make it possible to process large datasets, [27] extends LDA from the batch mode to the online setting. On recommender systems, [60] extends LDA to incorporate rating information and make recommendations. ...
Preprint
While perception tasks such as visual object recognition and text understanding play an important role in human intelligence, the subsequent tasks that involve inference, reasoning and planning require an even higher level of intelligence. The past few years have seen major advances in many perception tasks using deep learning models. For higher-level inference, however, probabilistic graphical models with their Bayesian nature are still more powerful and flexible. To achieve integrated intelligence that involves both perception and inference, it is naturally desirable to tightly integrate deep learning and Bayesian models within a principled probabilistic framework, which we call Bayesian deep learning. In this unified framework, the perception of text or images using deep learning can boost the performance of higher-level inference and in return, the feedback from the inference process is able to enhance the perception of text or images. This paper proposes a general framework for Bayesian deep learning and reviews its recent applications on recommender systems, topic models, and control. In this paper, we also discuss the relationship and differences between Bayesian deep learning and other related topics like Bayesian treatment of neural networks.
... More specifically, we focus on a big data setting in which the globally shared simplexconstrained model parameters could be linked to some latent counts via the multinomial likelihood. When there are tens of thousands or millions of observations in the dataset, scalable Bayesian inference for the simplex-constrained globally shared model parameters is highly desired, for example, for inferring the topics' distributions over words in latent Dirichlet allocation (Blei et al. 2003;Hoffman et al. 2010) and Poisson factor analysis (Zhou et al. 2012(Zhou et al. , 2016. ...
Preprint
We introduce a fast and easy-to-implement simulation algorithm for a multivariate normal distribution truncated on the intersection of a set of hyperplanes, and further generalize it to efficiently simulate random variables from a multivariate normal distribution whose covariance (precision) matrix can be decomposed as a positive-definite matrix minus (plus) a low-rank symmetric matrix. Example results illustrate the correctness and efficiency of the proposed simulation algorithms.
Article
Full-text available
For climate models to continue improving, we need to uncover as many discrepancies they have with reality as possible. In particular, evaluating the representation of extreme events is important but challenging owing to their rarity. Here, we study how general circulation models reproduce large-scale atmospheric circulation associated with extreme temperature events. To this end, we apply Latent Dirichlet Allocation (LDA), a dimensionality reduction method, to a set of sea-level pressure ERA5 maps over the north-Atlantic region. LDA provides a basis of sparse latent modes called “motifs” that consist of localized objects at synoptic scale. Any pressure map can be approximated by a generally sparse combination of motifs, whose coefficients are called the weights, containing local information about large-scale circulation. Weights statistics can be used to locally characterize circulation patterns, in general and during extreme events, allowing for detailed comparison of datasets. For four CMIP6 models and reanalysis, we quantify local circulation errors and identify model-agnostic and model-specific biases. On average, large-scale circulation is well predicted by all models, but model errors are increased for heatwaves and cold spells. Significant errors were found to be associated with Mediterranean motifs for all models in all cases. In addition, the combination of motif and temperature error can discriminate between models in the general and cold spell cases, while models perform similarly on heatwaves. The sparse characterization provided by LDA analysis is therefore well suited for model preselection for the study of extreme events.
Article
Full-text available
X (formerly known as Twitter), Reddit, and other social media forums have dramatically changed the way society interacts with live events in this day and age. The huge amount of data generated by these platforms presents challenges, especially in terms of processing speed and the complexity of finding meaningful patterns and events. These data streams are generated in multiple formats, with constant updating, and are real-time in nature; thus, they require sophisticated algorithms capable of dynamic event detection in this dynamic environment. Event detection techniques have recently achieved substantial development, but most research carried out so far evaluates only single methods, not comparing the overall performance of these methods across multiple platforms and types of data. With that view, this paper represents a deep investigation of complex state-of-the-art event detection algorithms specifically customized for streams of data from X. We review various current techniques based on a thorough comparative performance test and point to problems inherently related to the detection of patterns in high-velocity streams with noise. We introduce some novelty to this research area, supported by appropriate robust experimental frameworks, to performed comparisons quantitatively and qualitatively. We provide insight into how those algorithms perform under varying conditions by defining a set of clear, measurable metrics. Our findings contribute new knowledge that will help inform future research into the improvement of event detection systems for dynamic data streams and enhance their capabilities for real-time and actionable insights. This paper will go a step further than the present knowledge of event detection and discuss how algorithms can be adapted and refined in view of the emerging demands imposed by data streams.
Article
Microblogging platforms have been increasingly used by the public in crisis situations, enabling more participatory crisis communication between the official response channels and the affected community. However, the sheer volume of crisis-related messages on social media can make it challenging for officials to find pertinent information and understand the public’s perception of evolving risks. To address this issue, crisis informatics researchers have proposed a variety of technological solutions, but there has been limited examination of the cognitive and perceptual processes and subsequent responses of the affected population. Yet, this information is critical for the crisis response officials to gauge public’s understanding of the event, their perception of event-related risk, and perception of incident response and recovery efforts, in turn enabling the officials to craft crisis communication messaging more effectively. Taking cues from the Protective Action Decision Model, we conceptualize a metric —resonance+ — that prioritizes the cognitive and perceptual processes of the affected population, quantifying shifts in collective attention and information exposure for each tweet. Based on resonance+, we develop a principled, scalable pipeline that recommends content relating to people’s cognitive and perceptual processes. Our results suggest that resonance+ is generalizable across different types of natural hazards. We have also demonstrated its applicability for near-real time scenarios. According to the feedback from the target users, the local public information officers (PIOs) in emergency management, the messages recommended by our pipeline are useful in their tasks of understanding public perception and finding hopeful narratives, potentially leading to more effective crisis communications.
Article
Full-text available
This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.
Article
Full-text available
Variational inference methods, including mean field methods and loopy belief propagation, have been widely used for approximate probabilistic inference in graphical models.
Conference Paper
Full-text available
In this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. A personal profile, called CommunityNet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. It can be used for personal social capital management. Clusters of CommunityNets provide a view of informal networks for organization management. Our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. We tested CommunityNets on the Enron Email corpus and report experimental results including filtering, prediction, and recommendation capabilities. We show that the personal behavior and intention are somewhat predictable based on these models. For instance, "to whom a person is going to send a specific email" can be predicted by one's personal social network and content analysis. Experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on Latent Dirichlet Allocation with social network enhancement. Two online demo systems we developed that allow interactive exploration of CommunityNet are also discussed.
Article
The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use today, as illustrated by several examples. The stochastic approximation theory then provides general results describing the convergence of all these learning algorithms at once.
Article
Let M(x) denote the expected value at level x of the response to a certain experiment. M(x) is assumed to be a monotone function of x but is unknown to the experimenter, and it is desired to find the solution x=θx = \theta of the equation M(x)=αM(x) = \alpha, where α\alpha is a given constant. We give a method for making successive experiments at levels x1,x2,x_1,x_2,\cdots in such a way that xnx_n will tend to θ\theta in probability.
Article
Latent Dirichlet analysis, or topic modeling, is a flexible latent variable framework for model-ing high-dimensional sparse count data. Various learning algorithms have been developed in re-cent years, including collapsed Gibbs sampling, variational inference, and maximum a posteriori estimation, and this variety motivates the need for careful empirical comparisons. In this paper, we highlight the close connections between these approaches. We find that the main differences are attributable to the amount of smoothing applied to the counts. When the hyperparameters are op-timized, the differences in performance among the algorithms diminish significantly. The ability of these algorithms to achieve solutions of com-parable accuracy gives us the freedom to select computationally efficient approaches. Using the insights gained from this comparative study, we show how accurate topic models can be learned in several seconds on text corpora with thousands of documents.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Conference Paper
Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference tech- niques that are computationally expensive. With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new doc- uments without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification- based inference method produces results similar to iterative inference methods, but requires only a single matrix multi- plication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluat- ing Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substan- tially less memory.