Figure 5 - uploaded by Eric K. Ringger
Content may be subject to copyright.
The document view, looking at a speech by Hillary Clinton in the context of a topic about health care (the rest of the document is cut off in this screenshot, and sadly, the colored tokens in the document are not very visible in black and white).
Source publication
Topic models have been shown to reveal the semantic content in large corpora. Many individualized visualizations of topic models have been reported in the lit-erature, showing the potential of topic models to give valuable insight into a cor-pus. However, good, general tools for browsing the entire output of a topic model along with the analyzed co...
Contexts in source publication
Context 1
... simply browsing through the documents, we provide sorting and filtering methods on the list of documents, as mentioned previously. When looking at a particular document, we show basic information about the document, its text, the topic distribution in the document, and similar documents based on that distribution. When looking at the document in the context of a topic, we also highlight the tokens in the document that were labeled with that topic. For example, the user might be curious to see a document that best demonstrates the health care topic mentioned above. Clicking on the top document in the document list (as in the bottom of Figure 3) brings the user to the view shown in Figure 5. The document visualizations reported here were drawn from the work of others [1], and in fact the view in Figure 5 constitutes the entirety of most previous corpus browsers based on topic models (e.g., [2]). Our browser substantially goes beyond the existing functionality reported by others. We also provide views of individual words in the corpus. When viewing a word in the context of a topic, the user can see all uses of the word in that topic in the corpus, with context taken from their corresponding documents. The user can also view words independently with a search-like interface, seeing topics and documents in which the word appears most frequently. The user examining the health care topic might be curious where else the word “health” was used in the topics and in the corpus. Figure 6 shows the result of using our word search to answer that question. While providing basic functionality, however, the search interface leaves much to be desired, as only single words can currently be searched for. We plan on expanding that to phrases. The user can also look at aggregated information for the values of an attribute (i.e., a particular candidate or party), combining all of the topic and word counts for all documents with the given attribute. This view is also at present somewhat limited, showing only what topics and words are used most frequently by the collection of documents with that attribute. Figure 7 shows us, among other things, that one of Barack Obama’s top topics was “change in politics.” We currently include two kinds of plots in our topic browser, with plans to implement many more. The first shows trends for topics over the values of an attribute (such as date, or candidate), useful for corpus browsing. This kind of plot has been used in visualizing topic models almost since ...
Context 2
... simply browsing through the documents, we provide sorting and filtering methods on the list of documents, as mentioned previously. When looking at a particular document, we show basic information about the document, its text, the topic distribution in the document, and similar documents based on that distribution. When looking at the document in the context of a topic, we also highlight the tokens in the document that were labeled with that topic. For example, the user might be curious to see a document that best demonstrates the health care topic mentioned above. Clicking on the top document in the document list (as in the bottom of Figure 3) brings the user to the view shown in Figure 5. The document visualizations reported here were drawn from the work of others [1], and in fact the view in Figure 5 constitutes the entirety of most previous corpus browsers based on topic models (e.g., [2]). Our browser substantially goes beyond the existing functionality reported by others. We also provide views of individual words in the corpus. When viewing a word in the context of a topic, the user can see all uses of the word in that topic in the corpus, with context taken from their corresponding documents. The user can also view words independently with a search-like interface, seeing topics and documents in which the word appears most frequently. The user examining the health care topic might be curious where else the word “health” was used in the topics and in the corpus. Figure 6 shows the result of using our word search to answer that question. While providing basic functionality, however, the search interface leaves much to be desired, as only single words can currently be searched for. We plan on expanding that to phrases. The user can also look at aggregated information for the values of an attribute (i.e., a particular candidate or party), combining all of the topic and word counts for all documents with the given attribute. This view is also at present somewhat limited, showing only what topics and words are used most frequently by the collection of documents with that attribute. Figure 7 shows us, among other things, that one of Barack Obama’s top topics was “change in politics.” We currently include two kinds of plots in our topic browser, with plans to implement many more. The first shows trends for topics over the values of an attribute (such as date, or candidate), useful for corpus browsing. This kind of plot has been used in visualizing topic models almost since ...
Similar publications
Topic models are in widespread use in natural language processing and beyond. Here, we propose a new framework for the evaluation of probabilistic topic modeling algorithms based on synthetic corpora containing an unambiguously defined ground truth topic structure. The major innovation of our approach is the ability to quantify the agreement betwee...
Citations
... from the Latent Dirichlet Allocation algorithm [55]. LDAvis has two panels (left and right) (Fig. 2). ...
This paper extends an initial investigation of eHealth from the developers’ perspective. In this extension, our focus is on mobile health data. Despite the significant potential of this development area, few studies try to understand the challenges faced by these professionals. This perspective is relevant to identify the most used technologies and future perspectives for research investigation. Using a KDD-based process, this work analyzed eHealth and mHealth discussions from Stack Overflow (SO) to comprehend this developers’ community. We got and processed 6082 eHealth and 1832 mHealth questions. The most discussed topics include manipulating medical images, electronic health records with the HL7 standard, and frameworks to support mobile health (mHealth) development. Concerning the challenges faced by these developers, there is a lack of understanding of the DICOM and HL7 standards, the absence of data repositories for testing, and the monitoring of health data in the background using mobile and wearable devices. Our results also indicate that discussions have grown mainly on mHealth, primarily due to monitoring health data through wearables and about how to optimize resource consumption during health-monitoring.
... Effective visualization offers a tool for analysts to make inferences about the data through the lens of a model abstraction Chuang, Ramage, et al., 2012;Fortuna et al., 2005). Particularly due to the wide adoption of big data in different fields of study, there is a general agreement that visualization can support and enhance the interpretability of results (Gardner et al., 2010;Chaney & Blei, 2021;Gretarsson et al., 2011;Sievert & Shirley, 2014). In this study, we visualize LDA results by using a multidimensional scaling method to extend the analysis of topic modeling and illustrate the results of the analysis for increased interpretability. ...
Cities are critical sites for climate action. Population and infrastructure are concentrated in urban areas and their susceptibility to climate change impacts makes them a pivotal place to embark on adaptation plans and strategies. In the Fifth Assessment Report (AR5) the Intergovernmental Panel on Climate Change (IPCC) affirms that urban adaptation allows sustainable development and resilience. However, without evidence, this affirmation fails to acquire credibility and objectivity. As an attempt to provide the evidence for the assertion, this study examines the current actions in urban centers to determine if there is an alignment between adaptation and development. The study employs text mining techniques to analyze 400 urban project descriptions from Cities100 reports (2015–2019) of the C40 network. With Latent Dirichlet Allocation (LDA), a machine learning algorithm for topic model analysis, the study identifies 17 major topics. Using multidimensional scaling and cluster analysis to further characterize the findings, it finds an alignment of adaptation with urban sustainable and resilient development in several major cities. In this way, the paper makes a contribution to a global understanding of urban adaptation as well as demonstrates a way of adopting the grey literature into urban adaptation studies.
... Topics have to be interpreted by the analysts and it is widely recognized this interpretation could be very hard [26] [100]. To this aim, the analysts usually consider different approaches based on statistical methods [19] [86][121] [18] or exploit proper visualizations [53][25] [115][33] [32]. One of the most recent and effective solutions is LDAVis. ...
Phishing is the fraudulent attempt to obtain sensitive information by disguising oneself as a trustworthy entity in digital communication. It is a type of cyber attack often successful because users are not aware of their vulnerabilities or unable to understand the risks.
This article presents a Systematic Literature Review (SLR) conducted to draw a “big picture” of the most important research works performed on human factors and phishing. The analysis of the retrieved publications, framed along the research questions addressed in the SLR, helps understanding how human factors should be considered to defend against phishing attacks. Future research directions are also highlighted
... We experimentally proved that the proposed metrics outperform the state-of-the-art ones. We believe that these metrics should be considered in topic modeling visualization tools [11,12,22,15,23] for improving their performance and allow a user to obtain relevant results. As future work, different word embeddings methods could be investigated, also considering the word embeddings deriving from the state-of-the-art contextualized language models, e.g. ...
Topic models aim at discovering a set of hidden themes in a text corpus. A user might be interested in identifying the most similar topics of a given theme of interest. To accomplish this task, several similarity and distance metrics can be adopted. In this paper, we provide a comparison of the state-of-the-art topic similarity measures and propose novel metrics based on word embeddings. The proposed measures can overcome some limitations of the existing approaches, highlighting good capabilities in terms of several topic performance measures on benchmark datasets.
... To explore the relevance of the AI thematic subdomains and their relationships, we use LDAvis, a system for visualising and interpreting topics estimated by LDA (Sievert & Shirley, 2014). 22 This visualisation system allows each topic to be explored separately with a comparison of the terms' frequency in a topic and in the corpus, rather than only providing insights regarding the corpus in the form of word clouds per topic or bar plots per document (Chaney & Blei, 2012;Gardner et al., 2010;Snyder, Knowles, Dredze, Gormley, & Wolfe, 2013).In Fig. 4, the result of this visualisation is adapted to illustrate both the topics resulting from the topic model and the titles of the corresponding AI subdomains over the entire study period in the AI industrial and R&D activities. The topics' areas are analogous to the topics' prevalence in the corpus. ...
Artificial intelligence (AI) is playing a major role in the new paradigm shift occurring across the technological landscape. After a series of alternate seasons starting in the 60s, AI is now experiencing a new spring. Nevertheless, although it is spreading throughout our economies and societies in multiple ways, the absence of standardised classifications prevents us from obtaining a measure of its pervasiveness. In addition, AI cannot be identified as part of a specific sector, but rather as a transversal technology because the fields in which it is applied do not have precise boundaries. In this work, we address the need for a deeper understanding of this complex phenomenon by investigating economic agents’ involvement in industrial activities aimed to supply AI-related goods and services, and AI-related R&D processes in the form of patents and publications. In order to conduct this extensive analysis, we use a complex systems approach through the agent-artifact space model, which identifies the core dimensions that should be considered. Therefore, by considering the geographic location of the involved agents and their organisation types (i.e., firms, governmental institutions, and research institutes), we (i) provide an overview of the worldwide presence of agents, (ii) investigate the patterns in which AI technological subdomains subsist and scatter in different parts of the system, and (iii) reveal the size, composition, and topology of the AI R&D collaboration network. Based on a unique data collection of multiple micro-based data sources and supported by a methodological framework for the analysis of techno-economic segments (TES), we capture the state of AI in the worldwide landscape in the period 2009–2018. As expected, we find that major roles are played by the US, China, and the EU28. Nevertheless, by measuring the system, we unveil elements that provide new, crucial information to support more conscious discussions in the process of policy design and implementation.
... Several statistical algorithms have been applied to model topics in scientific literature [1][2][3][4][5]. As such methods require considerable mathematical and programming background, recent research proposes user friendly integrated tools to enable researchers of various backgrounds to explore topics analysis [6][7][8]. However, currently available tools do not cover the entire topics and trends analysis workflow and require custom set up. ...
Topic modeling refers to a suite of probabilistic algorithms for extracting word patterns from a collection of documents aiming for data clustering and detection of research trends. We developed an online service that implements different variations of Latent Dirichlet Allocation (LDA) algorithm. Scientific literature origin from targeted search queries in PubMed, works as input while output files are available for every step of the process. Researchers can compare the results of different corpora, preprocessing texts and topic modeling parameters in a quick and organized way. Information regarding topics help users assign labels and group them to categories. Visualization of data is a contribution of our service with graphs generated on the fly providing information about the corpora, the topics, groups of topics and categories as well. We rely in modern technologies and follow the principles of agile software development to achieve scalability and discreet design.
... Topical Guide (Gardner et al., 2010), Topic Viz (Eisenstein et al., 2012), and the Topic Model Visualization Engine (Chaney and Blei, 2012) are tools that support corpus understanding and directed browsing through topic models. They display the model overview as an aggregate of underlying topic visualizations. ...
... In practice, topic word lists have many variations. They can be represented horizontally (Gardner et al., 2010;Smith et al., 2015) or vertically (Eisenstein et al., 2012;Chaney and Blei, 2012), with or without commas separating the individual words, or using set notation (Chaney and Blei, 2012). Nguyen et al. (2013) add the weights to the word list by sizing the words based on their probability for the topic, which blurs the boundary with word clouds; however, this approach is not common. ...
Probabilistic topic models are important tools for indexing, summarizing, and analyzing large document collections by their themes. However, promoting end-user understanding of topics remains an open research problem. We compare labels generated by users given four topic visualization techniques—word lists, word lists with bars, word clouds, and network graphs—against each other and against automatically generated labels. Our basis of comparison is participant ratings of how well labels describe documents from the topic. Our study has two phases: a labeling phase where participants label visualized topics and a validation phase where different participants select which labels best describe the topics' documents. Although all visualizations produce similar quality labels, simple visualizations such as word lists allow participants to quickly understand topics, while complex visualizations take longer but expose multi-word expressions that simpler visualizations obscure. Automatic labels lag behind user-created labels, but our dataset of manually labeled topics highlights linguistic patterns (e.g., hypernyms, phrases) that can be used to improve automatic topic labeling algorithms.
... Second, the data can be transformed into both a fulltext corpus (by pasting together the tokens) and a DTM (by dropping the token positions). This also enables the results of some text analysis techniques to be visualized in the text, such as coloring words based on a word scale model (Slapin & Proksch, 2008) or to produce browsers for topic models (Gardner et al., 2010). Third, each token can be annotated with token specific information, such as obtained from advanced NLP techniques. ...
Computational text analysis has become an exciting research field with many applications in communication research. It can be a difficult method to apply, however, because it requires knowledge of various techniques, and the software required to perform most of these techniques is not readily available in common statistical software packages. In this teacher’s corner, we address these barriers by providing an overview of general steps and operations in a computational text analysis project, and demonstrate how each step can be performed using the R statistical software. As a popular open-source platform, R has an extensive user community that develops and maintains a wide range of text analysis packages. We show that these packages make it easy to perform advanced text analytics.
... Topic models do not automatically provide meaning-they must be manually interpreted and evaluated by domain experts [7]. The Topic Browser [22], Termite [12], LDAvis [50] and LDAExplore [21] focused on verifying model quality through visual comparisons of how well topics relate to each other and how well terms associate with each topic. Our work most closely relates to visual analysis tools that support the manual inspection and verification of the relevance and meaningfulness of latent topics. ...
PhenoLines is a visual analysis tool for the interpretation of disease subtypes, derived from the application of topic models to clinical data. Topic models enable one to mine cross-sectional patient comorbidity data (e.g., electronic health records) and construct disease subtypes-each with its own temporally evolving prevalence and co-occurrence of phenotypes-without requiring aligned longitudinal phenotype data for all patients. However, the dimensionality of topic models makes interpretation challenging, and de facto analyses provide little intuition regarding phenotype relevance or phenotype interrelationships. PhenoLines enables one to compare phenotype prevalence within and across disease subtype topics, thus supporting subtype characterization, a task that involves identifying a proposed subtype's dominant phenotypes, ages of effect, and clinical validity. We contribute a data transformation workflow that employs the Human Phenotype Ontology to hierarchically organize phenotypes and aggregate the evolving probabilities produced by topic models. We introduce a novel measure of phenotype relevance that can be used to simplify the resulting topology. The design of PhenoLines was motivated by formative interviews with machine learning and clinical experts. We describe the collaborative design process, distill high-level tasks, and report on initial evaluations with machine learning experts and a medical domain expert. These results suggest that PhenoLines demonstrates promising approaches to support the characterization and optimization of topic models.
... These features help group and identify the documents and facilitates the ease of use for the end user. To visualise the topic model and to present it in a user-friendly manner to the end user several visualisation systems for topic modelling have been built [4,5,6,8]. However, these systems focused on browsing the model and demonstrating the inter-connections amongst the documents, topics, and words. ...
Successful Cybersecurity depends on the processing of vast quantities of data from a diverse range of sources such as police reports, blogs, intelligence reports, security bulletins, and news sources. This results in large volumes of unstructured text data that is difficult to manage or investigate manually. In this paper we introduce a tool that summarises, categorises and models such data sets along with a search engine to query the model produced from the data. The search engine can be used to find links, similarities and differences between different documents in a way beyond the current search approaches. The tool is based on the probabilistic topic modelling technique which goes further than the lexical analysis of documents to model the subtle relationships between words, documents, and abstract topics. It will assists researchers to query the underlying models latent in the documents and tap into the repository of documents allowing them o be ordered thematically.