ArticlePublisher preview available

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Springer Nature
Multimedia Tools and Applications
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modelling; Latent Dirichlet Allocation (LDA) is one of the most popular in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper will be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. In addition, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.
This content is subject to copyright. Terms and conditions apply.
Multimedia Tools and Applications (2019) 78:15169–15211
https://doi.org/10.1007/s11042-018-6894-4
Latent Dirichlet allocation (LDA) and topic modeling: models,
applications, a survey
Hamed Jelodar1·Yongli Wang1,2 ·Chi Yuan1·Xia Feng1·Xiahui Jiang1·Yanchao Li1·
Liang Zhao1
Received: 5 June 2018 / Revised: 28 October 2018 / Accepted: 13 November 2018 /
Published online: 28 November 2018
©Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Topic modeling is one of the most powerful techniques in text mining for data mining, latent
data discovery, and finding relationships among data and text documents. Researchers have
published many articles in the field of topic modeling and applied in various fields such as
software engineering, political science, medical and linguistic science, etc. There are var-
ious methods for topic modelling; Latent Dirichlet Allocation (LDA) is one of the most
popular in this field. Researchers have proposed various models based on the LDA in topic
modeling. According to previous work, this paper will be very useful and valuable for intro-
ducing LDA approaches in topic modeling. In this paper, we investigated highly scholarly
articles (between 2003 to 2016) related to topic modeling based on LDA to discover the
research development, current trends and intellectual structure of topic modeling. In addi-
tion, we summarize challenges and introduce famous tools and datasets in topic modeling
based on LDA.
Keywords Topic modeling ·Latent Dirichlet allocation ·Tag recommendation ·
Semantic web ·Gibbs sampling
Yongli Wang
YongliWang@njust.edu.cn
Hamed Jelodar
Jelodar@njust.edu.cn
Chi Yuan
yuanchi@njust.edu.cn
Xia Feng
779477284@qq.com
Xiahui Jiang
jxhchina@gmail.com
1School of Computer Science and Technology, Nanjing University of Science and Technology,
Nanjing 210094, China
2China Electronics Technology Cyber Security Co., Ltd, Chengdu, China
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... Alternatively, probabilistic topic models are utilised in data mining and latent data discovery, as well as its most important implementation for discerning the relationship between data and text documents. Therefore, researchers have implemented the method to conduct aspect analyses [34,35]. For example, Latent Dirichlet Allocation (LDA) [36] is a popular unsupervised admixture or topic model, which specifically builds a set of topics based on an input collection of documents. ...
... For example, Latent Dirichlet Allocation (LDA) [36] is a popular unsupervised admixture or topic model, which specifically builds a set of topics based on an input collection of documents. As a parametric model, it incorporates the concept of Bag of Words (BoWs) for the articulation of topics; the standard LDA has been adopted by many researchers for aspect extraction [34,35,37,38]. Meanwhile, others have considered lexicon-based topic models [39][40][41], whereas works such as [42][43][44][45] have opted for distributed vector method and knowledge-based information as a supplementary approach for aspect analysis. ...
... Currently, the proposed topic models for aspect-analysis are mainly built on the premises of parametric model (i.e. LDA) as seen in different surveys such as [34,35,46], whereas a few others rely on its non-parametric counterpart (i.e. HDP) [47,48]. ...
Chapter
Full-text available
Aspect categorisation and its utmost importance in the field of Aspect-based Sentiment Analysis (ABSA) has encouraged researchers to improve topic model performance for modelling the aspects into categories. In general, a majority of its current methods implement parametric models requiring a predetermined number of topics beforehand. However, this is not efficiently undertaken with unan-notated text data as they lack any class label. Therefore, the current work presented a novel non-parametric model drawing a number of topics based on the semantic association present between opinion-targets (i.e., aspects) and their respective expressed sentiments. The model incorporated the Semantic Association Rules (SAR) into the Hierarchical Dirichlet Process (HDP), named ('SAR-HDP'). The phrase-based (or aspect-based) Bayesian model (SAR-HDP) did not consider the word's sentence being drawn from a single topic due to the presence of multiple aspects in a single review, which belonged to a multiple-aspect topic (i.e., category). Beyond its consideration of the semantic information for aspect identification , the proposed model further upheld the semantic information discerned between the drawn topics and aspects identified to maintain topic consistency. Empirical investigation showed that the approach positioned successfully outper-formed standard parametric models and nonparametric models in terms of aspect categorisation when subjected to restaurant and hotel reviews sourced from Amazon and TripAdvisor.
... Latent Dirichlet Allocation (LDA) merupakan model probabilistik generatif dari korpus, dimana dokumen direpresentasikan sebagai kumpulan topik laten dengan distribusi kata-kata yang akan membedakan setiap topik [9]. LDA berasumsikan bahwa setiap dokumen terdiri dari campuran topik yang berbeda yang masing-masing dapat dijelaskan melalui distribusi probabilitas. ...
... It identifies hidden topics in a collection of documents by analysing word frequency and distribution. LDA assumes each document is a mixture of topics, and each topic is represented by a set of words with varying probabilities (Jelodar et al., 2018). Researchers can specify the number of topics (k) to generate and the number of keywords (n) associated with each topic. ...
Article
Full-text available
Purpose This study aims to use topic modeling to identify different research dimensions of the e library research domain. This research analyses abstracts from a large set of scholarly articles to identify emerging topics and trends in sustainable e-library research to guide future efforts. These topics include those that have been studied a lot and those that need more research in the field of electronic libraries. Design/methodology/approach The study uses text mining and latent Dirichlet allocation (LDA) to analyse abstracts from 3,223 papers published in prestigious journals between 1961 and 2024. Post-2023 articles are abundant in the data set. Identifying research themes and gaps requires data collection, text pre-processing, LDA topic modelling and visualization. Findings The topic modelling-LDA output reveals 15 topics, main emphasis areas for “Sustainable E-Library” research. The findings reveal differences in research trends within various elements of electronic libraries. While topics such as “e-library practices”, “digital literacy in libraries”, “medical record analysis”, “mobile library platform”, “academic content management”, “technology impact studies”, “semantic search approach”, “virtual information environment”, “digital knowledge rights”, “digital research publishing” and “usability criteria analysis”, have garnered significant interest, others such as “open collection architecture”, “community support programs”, “online information access” and “multimedia data processing” are less common. There has been a substantial increase in research publications on various subjects, particularly in the past ten years. Factors determining the popularity of articles in the e-library sector include keywords searched in Scopus, the number of citations and the publisher’s name. Research limitations/implications The insights gained from this study, along with the discussions on each issue, will guide researchers, academicians, journal editors and practitioners in directing their research efforts. Notably, no prior study has used an intelligent algorithm to identify research clusters within the field of sustainable e-libraries. Originality/value This study contributes to the field by introducing LDA to identifying research trends in sustainable e-libraries. By analysing a large data set spanning several decades, this study provides valuable insights into prevalent research themes and areas requiring further investigation within the electronic library domain. The use of modern statistical methodologies enhances the study’s originality, providing researchers and practitioners with new insights for future research and practice in the field.
... LDA [23], a generative probabilistic model, was used to uncover latent themes in the textual data. LDA assumes that each post is a mixture of topics, and each topic is a distribution over words. ...
Preprint
Full-text available
The bidirectional relationship between sleep disturbances and depression presents a serious challenge for digital mental health research and intervention. This study introduces SleepDepNet, a transformer-based multi-task learning model designed to assess sleep quality and depressive sentiment simultaneously from user-generated narratives on Reddit. Leveraging a large, custom-labelled dataset drawn from subreddits such as r/depression, r/sleep, r/mentalhealth, and r/insomnia, SleepDepNet integrates attention mechanisms, sentiment and emotion analysis, and topic modelling to capture linguistic markers of emotional exhaustion and disordered sleep. The model achieves strong performance (F1-scores of 0.89 for sleep quality and 0.86 for depressive sentiment), while its attention-based interpretability supports transparent clinical insight. The proposed SleepDepScore, a unified metric derived from both tasks, offers a scalable approach to digital risk stratification and mental health triage. These results demonstrate SleepDepNet's potential for real-world deployment in AI-driven mental health monitoring and personalized digital care.
... Topic modeling is widely acknowledged as a method for discovering the latent themes that characterize a collection of documents (for recent reviews with emphasis on applications, see Churchill and Singh 2021;Jelodar, Wang, Yuan, Feng, Jiang, Li and Zhao 2019). Mathematically, a topic is defined as a multinomial distribution over a collection of words that constitute a vocabulary. ...
Preprint
The Topics over Time (ToT) model captures thematic changes in timestamped datasets by explicitly modeling publication dates jointly with word co-occurrence patterns. However, ToT was not approached in a fully Bayesian fashion, a flaw that makes it susceptible to stability problems. To address this issue, we propose a fully Bayesian Topics over Time (BToT) model via the introduction of a conjugate prior to the Beta distribution. This prior acts as a regularization that prevents the online version of the algorithm from unstable updates when a topic is poorly represented in a mini-batch. The characteristics of this prior to the Beta distribution are studied here for the first time. Still, this model suffers from a difference in scale between the single-time observations and the multiplicity of words per document. A variation of BToT, Weighted Bayesian Topics over Time (WBToT), is proposed as a solution. In WBToT, publication dates are repeated a certain number of times per document, which balances the relative influence of words and timestamps along the inference process. We have tested our models on two datasets: a collection of over 200 years of US state-of-the-union (SOTU) addresses and a large-scale COVID-19 Twitter corpus of 10 million tweets. The results show that WBToT captures events better than Latent Dirichlet Allocation and other SOTA topic models like BERTopic: the median absolute deviation of the topic presence over time is reduced by 51%51\% and 34%34\%, respectively. Our experiments also demonstrate the superior coherence of WBToT over BToT, which highlights the importance of balancing the time and word modalities. Finally, we illustrate the stability of the online optimization algorithm in WBToT, which allows the application of WBToT to problems that are intractable for standard ToT.
... Topic identification helps determine content suitability, while latent themes evaluate inter-document correlations [4]. Latent Dirichlet Allocation (LDA) [5,6], a probabilistic framework for document collections, has proven effective for multi-document summarization [7][8][9][10][11]. ...
Article
Full-text available
Automatic Text Summarization (ATS) compacts source content into a concise format while preserving core information. While extensively studied for resource-rich languages, ATS remains challenging for low-resource languages like Kannada due to limited corpora and NLP tools. This work introduces an extractive, topic-driven method for summarizing Kannada news articles from multiple documents. We developed a custom dataset of 100 Kannada news story sets (3 articles per set) to address the lack of standardized benchmarks. The proposed approach leverages Latent Dirichlet Allocation (LDA) to identify latent themes across documents, followed by sentence selection using vector-space modeling. Sentences are scored based on their relevance to identified topics (via cosine similarity) and prioritized to maximize informational value while minimizing redundancy through Maximum Marginal Relevance (MMR). Evaluations using ROUGE metrics demonstrate that the LDA-based method outperforms existing summarization algorithms, producing summaries closer to human-generated references. The system achieves higher F-scores (e.g., 0.68 at 40% compression) compared to baseline models like TextRank and approaches for other Indian languages, validating its efficacy for low-resource linguistic contexts.
... We then imported the.bib file into the R environment to automate and facilitate the analysis. Topic modelling (see Jelodar et al., 2019) was employed to identify the top terms in the literature. A number of preprocessing steps in R were carried out to clean and normalise the textual data. ...
Article
Full-text available
The wider application of extended reality (XR) in various industrial settings has created numerous opportunities for enhancing worker safety. Several XR solutions have been applied to address specific safety challenges faced by workers. This study reviewed the current literature (2017–2024) on how XR technologies can potentially enhance worker safety. The PRISMA protocol was used to highlight how XR technologies are utilized in safety training for high-risk industries, their limitations, and recommendations for future improvements. Findings from a review of 41 studies indicate diverse opportunities (e.g., improved knowledge and productivity, delivery of interactive and sequential instructions) for virtual reality (VR), augmented reality (AR), and mixed reality (MR) in industries such as mining, construction, manufacturing, healthcare, power distribution/thermal plants, aviation, and firefighting. Several challenges (e.g., limited viewing fields, motion sickness, and control issues) were identified in the use of VR, AR, and MR, stemming from both human and socio-technical factors. The overall sentiment towards the use of XR in safety training was predominantly positive (550 instances), reflecting confidence in these technologies to enhance safety training outcomes. Findings from this study offer new insights into the capabilities of XR technologies in improving worker safety in high-risk industries and outline key considerations for policymakers and technology providers when integrating XR technologies to promote worker safety.
Article
Full-text available
ABSTRAK Penulisan skripsi merupakan kewajiban akademik mahasiswa sebagai syarat kelulusan di perguruan tinggi. Program Studi Manajemen Dakwah di PTKIN se-Jawa memiliki keberagaman tema penelitian yang menarik untuk dianalisis lebih lanjut. Penelitian ini bertujuan untuk mengidentifikasi tren topik pada judul skripsi mahasiswa menggunakan metode pemodelan topik Latent Dirichlet Allocation (LDA). Data yang digunakan berasal dari repository digital PTKIN, mencakup 1.648 judul skripsi dari tahun 2021 hingga 2023. Teknik pengambilan sampel dilakukan melalui web scraping dengan library BeautifulSoup, menghasilkan kumpulan judul yang dianalisis lebih lanjut. Hasil pemodelan LDA mengidentifikasi lima topik utama, yaitu pengelolaan lembaga dakwah, strategi komunikasi dakwah, peran masjid dan pesantren, serta pemanfaatan teknologi dalam dakwah. Validasi model menggunakan coherence score menunjukkan hasil 0,5107, yang mengindikasikan kualitas pemodelan yang baik. Hasil penelitian ini dapat menjadi referensi bagi mahasiswa dalam memilih tema skripsi yang relevan serta membantu institusi akademik dalam memahami perkembangan riset di bidang Manajemen Dakwah. Namun, keterbatasan dalam cakupan data menunjukkan perlunya studi lanjutan untuk memperluas analisis ke lebih banyak institusi pendidikan. Kata Kunci: Latent Dirichlet Allocation, Manajemen Dakwah, Pemodelan Topik, Skripsi, Tren Penelitian I. PENDAHULUAN Penulisan skripsi merupakan salah satu kewajiban akademik yang harus diselesaikan oleh mahasiswa sebagai syarat kelulusan. Dalam konteks Program Studi Manajemen Dakwah di Perguruan Tinggi Keagamaan Islam Negeri (PTKIN), skripsi memiliki peran penting sebagai cerminan perkembangan keilmuan dan tren penelitian di program studi tersebut. Selain itu, skripsi juga menjadi bukti kemampuan mahasiswa dalam menganalisis dan memberikan solusi terhadap permasalahan dakwah. Manajemen Dakwah sebagai disiplin ilmu berfokus pada pengelolaan aktivitas dakwah, mencakup aspek strategi, komunikasi, hingga teknologi yang mendukung keberhasilan dakwah di tengah masyarakat. Namun, meskipun Program Studi Manajemen Dakwah di Perguruan Tinggi Keagamaan Islam Negeri (PTKIN) memiliki konsentrasi keilmuan yang spesifik, mahasiswa sering kali menghadapi tantangan dalam menentukan tema penelitian yang relevan dengan kebutuhan keilmuan maupun praktik di masyarakat. Hal ini dapat disebabkan oleh kurangnya panduan atau pemetaan terhadap tema-tema penelitian yang telah dilakukan sebelumnya. Di PTKIN khususnya di Jawa, yang memiliki jumlah mahasiswa dan institusi yang cukup banyak, pemetaan tema penelitian menjadi semakin mendesak untuk menghindari duplikasi serta memastikan kontribusi
Article
Full-text available
Socio-economic maps contain important information regarding the population of a country. Computing these maps is critical given that policy makers often times make important decisions based upon such information. However, the compilation of socio-economic maps requires extensive resources and becomes highly expensive. On the other hand, the ubiquitous presence of cell phones, is generating large amounts of spatiotemporal data that can reveal human behavioral traits related to specific socio-economic characteristics. Traditional inference approaches have taken advantage of these datasets to infer regional socio-economic characteristics. In this paper, we propose a novel approach whereby topic models are used to infer socio-economic levels from large-scale spatio-temporal data. Instead of using a pre-determined set of features, we use latent Dirichlet Allocation (LDA) to extract latent recurring patterns of co-occurring behaviors across regions, which are then used in the prediction of socio-economic levels. We show that our approach improves state of the art prediction results by 9%.
Article
Full-text available
Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
Article
In this paper we aim to model the relationship be- tween the text of a political blog post and the comment volume—that is, the total amount of response—that a post will receive. We seek to accurately identify which posts will attract a high-volume response, and also to gain insight about the community of readers and their interests. We design and evaluate variations on a latent- variable topic model that links text to comment volume.
Article
In this paper, we propose a generative model to automatically discover the hidden associations between topics words and opinion words. By applying those discovered hidden associations, we construct the opinion scoring models to extract statements which best express opinionists’ standpoints on certain topics. For experiments, we apply our model to the political area. First, we visualize the similarities and dissimilarities between Republican and Democratic senators with respect to various topics. Second, we compare the performance of the opinion scoring models with 14 kinds of methods to find the best ones. We find that sentences extracted by our opinion scoring models can effectively express opinionists’ standpoints.
Article
With the development of Web 2.0, sentiment analysis has now become a popular research problem to tackle. Recently, topic models have been introduced for the simultaneous analysis for topics and the sentiment in a document. These studies, which jointly model topic and sentiment, take the advantage of the relationship between topics and sentiment, and are shown to be superior to traditional sentiment analysis tools. However, most of them make the assumption that, given the parameters, the sentiments of the words in the document are all independent. In our observation, in contrast, sentiments are expressed in a coherent way. The local conjunctive words, such as “and” or “but”, are often indicative of sentiment transitions. In this paper, we propose a major departure from the previous approaches by making two linked contributions. First, we assume that the sentiments are related to the topic in the document, and put forward a joint sentiment and topic model, i.e. Sentiment-LDA. Second, we observe that sentiments are dependent on local context. Thus, we further extend the Sentiment-LDA model to Dependency-Sentiment-LDA model by relaxing the sentiment independent assumption in Sentiment-LDA. The sentiments of words are viewed as a Markov chain in Dependency-Sentiment-LDA. Through experiments, we show that exploiting the sentiment dependency is clearly advantageous, and that the Dependency-Sentiment-LDA is an effective approach for sentiment analysis.
Article
This paper presents the Topic-Aspect Model (TAM), a Bayesian mixture model which jointly discovers topics and aspects. We broadly define an aspect of a document as a characteristic that spans the document, such as an underlying theme or perspective. Unlike previous models which cluster words by topic or aspect, our model can generate token assignments in both of these dimensions, rather than assuming words come from only one of two orthogonal models. We present two applications of the model. First, we model a corpus of computational linguistics abstracts, and find that the scientific topics identified in the data tend to include both a computational aspect and a linguistic aspect. For example, the computational aspect of GRAMMAR emphasizes parsing, whereas the linguistic aspect focuses on formal languages. Secondly, we show that the model can capture different viewpoints on a variety of topics in a corpus of editorials about the Israeli-Palestinian conflict. We show both qualitative and quantitative improvements in TAM over two other state-of-the-art topic models.
Article
People go to fortune tellers in hopes of learning things about their future. A future career path is one of the topics most frequently discussed. But rather than rely on "black arts" to make predictions, in this work we scientifically and systematically study the feasibility of career path prediction from social network data. In particular, we seamlessly fuse information from multiple social networks to comprehensively describe a user and characterize progressive properties of his or her career path. This is accomplished via a multi-source learning framework with fused lasso penalty, which jointly regularizes the source and career-stage relatedness. Extensive experiments on real-world data confirm the accuracy of our model.
Article
Adverse drug reaction (ADR) is a major burden for patients and healthcare industry. It usually causes preventable hospitalizations and deaths, while associated with a huge amount of cost. Traditional preclinical in vitro safety profiling and clinical safety trials are restricted in terms of small scale, long duration, huge financial costs and limited statistical signifi- cance. The availability of large amounts of drug and ADR data potentially allows ADR predictions during the drugs’ early preclinical stage with data analytics methods to inform more targeted clinical safety tests. Despite their initial success, existing methods have trade-offs among interpretability, predictive power and efficiency. This urges us to explore methods that could have all these strengths and provide practical solutions for real world ADR predictions. We cast the ADR-drug relation structure into a three-layer hierarchical Bayesian model. We interpret each ADR as a symbolic word and apply latent Dirichlet allocation (LDA) to learn topics that may represent certain biochemical mechanism that relates ADRs with drug structures. Based on LDA, we designed an equivalent regularization term to incorporate the hierarchical ADR domain knowledge. Finally, we developed a mixed input model leveraging a fast collapsed Gibbs sampling method that the complexity of each iteration of Gibbs sampling proportional only to the number of positive ADRs. Experiments on real world data show our models achieved higher prediction accuracy and shorter running time than the state-of-the-art alternatives.
Article
Topic models remain a black box both for modelers and for end users in many respects. From the modelers' perspective, many decisions must be made which lack clear rationales and whose interactions are unclear — for example, how many topics the algorithms should find (K), which words to ignore (aka the "stop list"), and whether it is adequate to run the modeling process once or multiple times, producing different results due to the algorithms that approximate the Bayesian priors. Furthermore, the results of different parameter settings are hard to analyze, summarize, and visualize, making model comparison difficult. From the end users' perspective, it is hard to understand why the models perform as they do, and information-theoretic similarity measures do not fully align with humanistic interpretation of the topics. We present the Topic Explorer, which advances the state-of-the-art in topic model visualization for document-document and topic-document relations. It brings topic models to life in a way that fosters deep understanding of both corpus and models, allowing users to generate interpretive hypotheses and to suggest further experiments. Such tools are an essential step toward assessing whether topic modeling is a suitable technique for AI and cognitive modeling applications.
Article
In this work, we investigate the problem of learning knowledge from the massive community-contributed images with rich weakly-supervised context information, which can benefit multiple image understanding tasks simultaneously, such as social image tag refinement and assignment, content-based image retrieval, tag-based image retrieval and tag expansion. Towards this end, we propose a Deep Collaborative Embedding (DCE) model to uncover a unified latent space for images and tags. The proposed method incorporates the end-to-end learning and collaborative factor analysis in one unified framework for the optimal compatibility of representation learning and latent space discovery. A nonnegative and discrete refined tagging matrix is learned to guide the end-to-end learning. To collaboratively explore the rich context information of social images, the proposed method integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly. The proposed model is also extended to embed new tags in the uncovered space. To verify the effectiveness of the proposed method, extensive experiments are conducted on two widely-used social image benchmarks for multiple social image understanding tasks. The encouraging performance of the proposed method over the state-of-the-art approaches demonstrates its superiority.