Book

Integrating Artificial Intelligence and Visualization for Visual Knowledge Discovery

Abstract

This book is devoted to the emerging field of integrated visual knowledge discovery that combines advances in artificial intelligence/machine learning and visualization/visual analytic. A long-standing challenge of artificial intelligence (AI) and machine learning (ML) is explaining models to humans, especially for live-critical applications like health care. A model explanation is fundamentally human activity, not only an algorithmic one. As current deep learning studies demonstrate, it makes the paradigm based on the visual methods critically important to address this challenge. In general, visual approaches are critical for discovering explainable high-dimensional patterns in all types in high-dimensional data offering "n-D glasses," where preserving high-dimensional data properties and relations in visualizations is a major challenge. The current progress opens a fantastic opportunity in this domain. This book is a collection of 25 extended works of over 70 scholars presented at AI and visual analytics related symposia at the recent International Information Visualization Conferences with the goal of moving this integration to the next level. The sections of this book cover integrated systems, supervised learning, unsupervised learning, optimization, and evaluation of visualizations. The intended audience for this collection includes those developing and using emerging AI/machine learning and visualization methods. Scientists, practitioners, and students can find multiple examples of the current integration of AI/machine learning and visualization for visual knowledge discovery. The book provides a vision of future directions in this domain. New researchers will find here an inspiration to join the profession and to be involved for further development. Instructors in AI/ML and visualization classes can use it as a supplementary source in their undergraduate and graduate classes.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these `black boxes' as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work.
Conference Paper
Full-text available
In this paper, we study the visual design of hierarchical multivariate data analysis. We focus on the extension of four hierarchical univariate concepts---the sunburst chart, the icicle plot, the circular treemap, and the bubble treemap---to the multivariate domain. Our study identifies several advantageous design variants, which we discuss with respect to previous approaches, and whose utility we evaluate with a user study and demonstrate for different analysis purposes and different types of data.
Article
Full-text available
Cyber-physical systems in smart factories get more and more integrated and interconnected. Industry 4.0 accelerates this trend even further. Through the broad interconnectivity a new class of faults arise, the contextual faults, where contextual knowledge is needed to find the underlying reason. Fully-automated systems and the production line in a smart factory form a complex environment making the fault diagnosis non-trivial. Along with a dataset, we give a first definition of contextual faults in the smart factory and name initial use cases. Additionally, the dataset encompasses all the data recorded in a current state-of-the-art smart factory. We also add additional information measured by our developed sensing units to enrich the smart factory data even further. In the end, we show a first approach to detect the contextual faults in a manual preliminary analysis of the recorded log data.
Article
Full-text available
Road traffic emissions are considered a major contributor to urban air pollution, but clean air actions have led to a huge reduction in emissions per vehicle. This raises a pressing question on the potential to further reduce road traffic emissions to improve air quality. Here, we analysed ~11 million real-world data to estimate the contribution of road traffic to roadside and urban concentrations for several major cities. Our results confirm that road traffic remains a dominant source of nitrogen dioxide and a significant source of primary coarse particulate matter in the European cities. However, it now represents a relatively small component of overall PM2.5 at urban background locations in cities with strong controls on traffic emissions (including European cities and Beijing) and many roadside sites will exceed the WHO guideline (10 μg m⁻³ annual mean) even when this source is eliminated. This suggests that further controls on traffic emissions, including the transition to a battery-electric fleet, are needed to reduce NO2 concentrations, but this will have limited benefit to reduce the concentration of fine particles, except in countries where the use of diesel particle filters is not mandatory. There are substantial differences between cities and the optimal solution will differ from one to another.
Article
Full-text available
Visual analytics for machine learning has recently evolved as one of the most exciting areas in the field of visualization. To better identify which research topics are promising and to learn how to apply relevant techniques in visual analytics, we systematically review 259 papers published in the last ten years together with representative works before 2010. We build a taxonomy, which includes three first-level categories: techniques before model building, techniques during modeling building, and techniques after model building. Each category is further characterized by representative analysis tasks, and each task is exemplified by a set of recent influential works. We also discuss and highlight research challenges and promising potential future research opportunities useful for visual analytics researchers.
Article
Full-text available
Scatterplots and scatterplot matrix methods have been popularly used for showing statistical graphics and for exposing patterns in multivariate data. A recent technique, called Linkable Scatterplots, provides an interesting idea for interactive visual exploration which provides a set of necessary plot panels on demand together with interaction, linking and brushing. This article presents a controlled study with a mixed-model design to evaluate the effectiveness and user experience on the visual exploration when using a Sequential-Scatterplots who a single plot is shown at a time, Multiple-Scatterplots who number of plots can be specified and shown, and Simultaneous-Scatterplots who all plots are shown as a scatterplot matrix. Results from the study demonstrated higher accuracy using the Multiple-Scatterplots visualization, particularly in comparison with the Simultaneous-Scatterplots. While the time taken to complete tasks was longer in the Multiple-Scatterplots technique, compared with the simpler Sequential-Scatterplots, Multiple-Scatterplots is inherently more accurate. Moreover, the Multiple-Scatterplots technique is the most highly preferred and positively experienced technique in this study. Overall, results support the strength of Multiple-Scatterplots and highlight its potential as an effective data visualization technique for exploring multivariate data.
Article
Full-text available
Matrix visualizations are a useful tool to provide a general overview of a graph's structure. For multivariate graphs, a remaining challenge is to cope with the attributes that are associated with nodes and edges. Addressing this challenge, we propose responsive matrix cells as a focus+context approach for embedding additional interactive views into a matrix. Responsive matrix cells are local zoomable regions of interest that provide auxiliary data exploration and editing facilities for multivariate graphs. They behave responsively by adapting their visual contents to the cell location, the available display space, and the user task. Responsive matrix cells enable users to reveal details about the graph, compare node and edge attributes, and edit data values directly in a matrix without resorting to external views or tools. We report the general design considerations for responsive matrix cells covering the visual and interactive means necessary to support a seamless data exploration and editing. Responsive matrix cells have been implemented in a web-based prototype based on which we demonstrate the utility of our approach. We describe a walk-through for the use case of analyzing a graph of soccer players and report on insights from a preliminary user feedback session.
Preprint
Full-text available
Mass cytometry, also known as CyTOF, is a newly developed technology for quantification and classification of immune cells that can allow for analysis of over three dozen protein markers per cell. The high dimensional data that is generated requires innovative methods for analysis and visualization. We conducted a comparative analysis of four dimension reduction techniques – principal component analysis (PCA), isometric feature mapping (Isomap), t-distributed stochastic neighbor embedding (t-SNE), and Diffusion Maps by implementing them on benchmark mass cytometry data sets. We compare the results of these reductions using computation time, residual variance, a newly developed comparison metric we term neighborhood proportion error (NPE), and two-dimensional visualizations. We find that t-SNE and Diffusion Maps are the two most effective methods for preserving relationships of interest among cells and providing informative visualizations. In low dimensional embeddings, t-SNE exhibits well-defined phenotypic clustering. Additionally, Diffusion Maps can represent cell differentiation pathways with long projections along each diffusion component. We thus recommend a complementary approach using t-SNE and Diffusion Maps in order to extract diverse and informative cell relationship information in a two-dimensional setting from CyTOF data.
Conference Paper
Full-text available
In 2018, Lisbon is taking a relevant step launching the very first platform for managing a smart(er) city in Portugal. The platform will support the Lisbon Integrated Operation Center (IOC), integrating Lisbon Municipality (CML) main internal information systems together with external systems from strategic partners. The platform will increase the value of existing systems, building the foundation for new modern city services, in a logic of integration and value generation. However, it also represents many and complex technical challenges. Combining short time frames for the public tender setup and platform implementation time schedule, together with a multitude of systems to integrate, all these challenges require a different new methodology. This paper presents this methodology capable of managing complex and technically challenging market proposals by classifying them in a clearer, technical and objective criterion. The methodology adopted by CML in partnership with Instituto Superior de Engenharia de Lisboa (ISEL) for the definition of the IOC platform, ensures broad adoption of standard and open protocols, capacity for growth and low cost/effort integration with the various external partners effectively leveraging the relationship both ways. Reducing the vendor lock-in classical dependencies. The integration of multiple information sources will provide a holistic view of the city, ensuring a more efficient response to a crisis, providing a daily work tool to the different departments within CML, aligning Lisbon with the smart city movement.
Conference Paper
Full-text available
Artificial Intelligent (AI) techniques, such as ma- chine learning (ML), have been making significant progress over the past decade. Many systems have been applied in sensitive tasks involving critical infrastructures which affect human well- being or health. Before deploying an AI system, it is necessary to validate its behavior and guarantee that it will continue to perform as expected when deployed in a real-world environment. For this reason, it is important to comprehend specific aspects of such systems. For example, understanding how neural networks produce final predictions remains a fundamental challenge. Exist- ing work on interpreting neural network predictions for images via feature visualization often focuses on explaining predictions for neurons of one single convolutional layer. Not presenting a global perspective over the features learned by the model leads the user to miss the bigger picture. In this work we focus on providing a representation based on the structure of deep neural networks. It presents a visualization able to give the user a global perspective over the feature maps of a convolutional neural network (CNN) in a single image, revealing potential problems of the learning representations present in the network feature maps.
Article
Full-text available
The diffusion of connected devices in modern vehicles involves a lack in security of the in-vehicle communication networks such as the controller area network (CAN) bus. The CAN bus protocol does not provide security systems to counter cyber and physical attacks. Thus, an intrusion-detection system to identify attacks and anomalies on the CAN bus is desirable. In the present work, we propose a distance-based intrusion-detection network aimed at identifying attack messages injected on a CAN bus using a Kohonen self-organizing map (SOM) network. It is a power classifier that can be trained both as supervised and unsupervised learning. SOM found broad application in security issues, but was never performed on in-vehicle communication networks. We performed two approaches, first using a supervised X–Y fused Kohonen network (XYF) and then combining the XYF network with a K-means clustering algorithm (XYF–K) in order to improve the efficiency of the network. The models were tested on an open source dataset concerning data messages sent on a CAN bus 2.0B and containing large traffic volume with a low number of features and more than 2000 different attack types, sent totally at random. Despite the complex structure of the CAN bus dataset, the proposed architectures showed a high performance in the accuracy of the detection of attack messages.
Article
This paper presents a system that employs information visualization techniques to analyze urban traffic data and the impact of traffic emissions on urban air quality. Effective visualizations allow citizens and public authorities to identify trends, detect congested road sections at specific times, and perform monitoring and maintenance of traffic sensors. Since road transport is a major source of air pollution, also the impact of traffic on air quality has emerged as a new issue that traffic visualizations should address. Trafair Traffic Dashboard exploits traffic sensor data and traffic flow simulations to create an interactive layout focused on investigating the evolution of traffic in the urban area over time and space. The dashboard is the last step of a complex data framework that starts from the ingestion of traffic sensor observations, anomaly detection, traffic modeling, and also air quality impact analysis. We present the results of applying our proposed framework on two cities (Modena, in Italy, and Santiago de Compostela, in Spain) demonstrating the potential of the dashboard in identifying trends, seasonal events, abnormal behaviors, and understanding how urban vehicle fleet affects air quality. We believe that the framework provides a powerful environment that may guide the public decision-makers through effective analysis of traffic trends devoted to reducing traffic issues and mitigating the polluting effect of transportation.
Chapter
Following the analysis given by Alan Turing in 1951, one must expect that AI capabilities will eventually exceed those of humans across a wide range of real-world-decision making scenarios. Should this be a cause for concern, as Turing, Hawking, and others have suggested? And, if so, what can we do about it? While some in the mainstream AI community dismiss the issue, I will argue that the problem is real: we have to work out how to design AI systems that are far more powerful than ourselves while ensuring that they never have power over us. I believe the technical aspects of this problem are solvable. Whereas the standard model of AI proposes to build machines that optimize known, exogenously specified objectives, a preferable approach would be to build machines that are of provable benefit to humans. I introduce assistance games as a formal class of problems whose solution, under certain assumptions, has the desired property.
Chapter
This chapter surveys and analyses visual methods of explainability of Machine Learning (ML) approaches with focus on moving from quasi-explanations that dominate in ML to actual domain-specific explanation supported by granular visuals. The importance of visual and granular methods to increase the interpretability and validity of the ML model has grown in recent years. Visuals have an appeal to human perception, which other methods do not. ML interpretation is fundamentally a human activity, not a machine activity. Thus, visual methods are more readily interpretable. Visual granularity is a natural way for efficient ML explanation. Understanding complex causal reasoning can be beyond human abilities without “downgrading” it to human perceptual and cognitive limits. The visual exploration of multidimensional data at different levels of granularity for knowledge discovery is a long-standing research focus. While multiple efficient methods for visual representation of high-dimensional data exist, the loss of interpretable information, occlusion, and clutter continue to be a challenge, which lead to quasi-explanations. This chapter starts with the motivation and the definitions of different forms of explainability and how these concepts and information granularity can integrate in ML. The chapter focuses on a clear distinction between quasi-explanations and actual domain specific explanations, as well as between potentially explainable and an actually explained ML model that are critically important for the further progress of the ML explainability domain. We discuss foundations of interpretability, overview visual interpretability and present several types of methods to visualize the ML models. Next, we present methods of visual discovery of ML models, with the focus on interpretable models, based on the recently introduced concept of General Line Coordinates (GLC). This family of methods take the critical step of creating visual explanations that are not merely quasi-explanations but are also domain specific visual explanations while these methods themselves are domain-agnostic. The chapter includes results on theoretical limits to preserve n-D distances in lower dimensions, based on the Johnson-Lindenstrauss lemma, point-to-point and point-to-graph GLC approaches, and real-world case studies. The chapter also covers traditional visual methods for understanding multiple ML models, which include deep learning and time series models. We illustrate that many of these methods are quasi-explanations and need further enhancement to become actual domain specific explanations. The chapter concludes with outlining open problems and current research frontiers.
Conference Paper
Abstract—Air pollution is the second biggest environmental concern for Europeans after climate change and the major risk to public health. It is imperative to monitor the spatiotemporal patterns of urban air pollution. The TRAFAIR air quality dashboard is an effective web application to empower decision-makers to be aware of the urban air quality conditions to define new policies and keep monitoring their effects. The architecture copes with the multidimensionality of data and the real-time visualization challenge of big data streams coming from a network of low-cost sensors. Moreover, it handles the visualization and management of predictive air quality maps series that is produced by an air pollution dispersion model. Air quality data are not only visualized at a limited set of locations at different times but in the continuous space-time domain, thanks to interpolated maps that estimate the pollution at un-sampled locations. Index Terms—air quality, temporal data, spatial data, visualization, interpolation maps, prediction
Chapter
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene, we present Value (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection) generalizable to standard pre-trained V+L models, to decipher the inner workings of multimodal pre-training (e.g., implicit knowledge garnered in individual attention heads, inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training. (Code is available at https://github.com/JizeCao/VALUE).
Conference Paper
This work proposes a workflow for the publication of Open Spatial Data. The main contribution of this work is the automatic generation of metadata extracted from Open Geospatial Consortium (OGC) spatial services providing access to feature types and coverages. Besides, this work adopts a geospatial extension of the Data Catalog Vocabulary metadata application profile for data portals in Europe for the description of datasets. This extension, called GeoDCAT-AP, has been adopted because it allows for an appropriate crosswalk between the annotation requirements in the spatial domain and the metadata models accepted in general Open Data portals. The feasibility of the proposed workflow has been tested within the framework of the TRAFAIR project to publish monitoring and forecasting air quality data. Full-text available at: https://www.thinkmind.org/index.php?view=article&articleid=geoprocessing_2020_1_130_30086
Article
Advances in language modeling have led to the development of deep attention-based models that are performant across a wide variety of natural language processing (NLP) problems. These language models are typified by a pre-training process on large unlabeled text corpora and subsequently fine-tuned for specific tasks. Although considerable work has been devoted to understanding the attention mechanisms of pre-trained models, it is less understood how a model's attention mechanisms change when trained for a target NLP task. In this paper, we propose a visual analytics approach to understanding fine-tuning in attention-based language models. Our visualization, Attention Flows, is designed to support users in querying, tracing, and comparing attention within layers, across layers, and amongst attention heads in Transformer-based language models. To help users gain insight on how a classification decision is made, our design is centered on depicting classification-based attention at the deepest layer and how attention from prior layers flows throughout words in the input. Attention Flows supports the analysis of a single model, as well as the visual comparison between pre-trained and fine-tuned models via their similarities and differences. We use Attention Flows to study attention mechanisms in various sentence understanding tasks and highlight how attention evolves to address the nuances of solving these tasks.
Article
The coronavirus disease 2019 (COVID-19) pandemic has caused an unprecedented global health crisis, with several countries imposing lockdowns to control the coronavirus spread. Important research efforts are focused on evaluating the association of environmental factors with the survival and spread of the virus and different works have been published, with contradictory results in some cases. Data with spatial and temporal information is a key factor to get reliable results and, although there are some data repositories for monitoring the disease both globally and locally, an application that integrates and aggregates data from meteorological and air quality variables with COVID-19 information has not been described so far to the best of our knowledge. Here, we present DatAC (Data Against COVID-19), a data fusion project with an interactive web frontend that integrates COVID-19 and environmental data in Spain. DatAC is provided with powerful data analysis and statistical capabilities that allow users to explore and analyze individual trends and associations among the provided data. Using the application, we have evaluated the impact of the Spanish lockdown on the air quality, observing that NO2, CO, PM2.5, PM10 and SO2 levels decreased drastically in the entire territory, while O3 levels increased. We observed similar trends in urban and rural areas, although the impact has been more important in the former. Moreover, the application allowed us to analyze correlations among climate factors, such as ambient temperature, and the incidence of COVID-19 in Spain. Our results indicate that temperature is not the driving factor and without effective control actions, outbreaks will appear and warm weather will not substantially limit the growth of the pandemic. DatAC is available at https://covid19.genyo.es.