ArticlePDF Available

Missing Data in Interactive High-Dimensional Data Visualization

Authors:

Abstract and Figures

this paper we documented potential uses of linked brushing (as in figure 2) to explore missing value patterns and associations between missing values and the variables of interest. Other interactive data visualization methods can be equally useful and apply immediately in this approach. We hope to have provided a useful adjunct to traditional methodology for missing value problems. We expect it to be quite useful for data exploration as well as for imputation diagnostics. The XGobi software is freely available from StatLib at this URL: http://lib.stat.cmu.edu/general/XGobi/ For more references on XGobi, visit the following (identical) web pages: http://www.public.iastate.edu/~dicook/xgobi.html http://www.research.att.com/~andreas/xgobi.html References
Content may be subject to copyright.
Article
Massive simulations and arrays of sensing devices, in combination with increasing computing resources, have generated large, complex, high-dimensional datasets used to study phenomena across numerous fields of study. Visualization plays an important role in exploring such datasets. We provide a comprehensive survey of advances in high-dimensional data visualization that focuses on the past decade. We aim at providing guidance for data practitioners to navigate through a modular view of the recent advances, inspiring the creation of new visualizations along the enriched visualization pipeline, and identifying future opportunities for visualization research.
Conference Paper
Full-text available
Dealing with the curse of dimensionality is a key challenge in high-dimensional data visualization. We present SeekAView to address three main gaps in the existing research literature. First, automated methods like dimensionality reduction or clustering suffer from a lack of transparency in letting analysts interact with their outputs in real-time to suit their exploration strategies. The results often suffer from a lack of interpretability, especially for domain experts not trained in statistics and machine learning. Second, exploratory visualization techniques like scatter plots or parallel coordinates suffer from a lack of visual scalability: it is difficult to present a coherent overview of interesting combinations of dimensions. Third, the existing techniques do not provide a flexible workflow that allows for multiple perspectives into the analysis process by automatically detecting and suggesting potentially interesting subspaces. In SeekAView we address these issues using suggestion based visual exploration of interesting patterns for building and refining multidimensional subspaces. Compared to the state-of-the-art in subspace search and visualization methods, we achieve higher transparency in showing not only the results of the algorithms, but also interesting dimensions calibrated against different metrics. We integrate a visually scalable design space with an iterative workflow guiding the analysts by choosing the starting points and letting them slice and dice through the data to find interesting subspaces and detect correlations, clusters, and outliers. We present two usage scenarios for demonstrating how SeekAView can be applied in real-world data analysis scenarios.
Article
Visualization provides a powerful means for data analysis. But to be practical, visual analytics tools must support smooth and flexible use of visualizations at a fast rate. This becomes increasingly onerous with the ever-increasing size of real-world datasets. First, large databases make interaction more difficult once query response time exceeds several seconds. Second, any attempt to show all data points will overload the visualization, resulting in chaos that will only confuse the user. Over the last few years substantial effort has been put into addressing both of these issues and many innovative solutions have been proposed. Indeed, data visualization is a topic that is too large to be addressed in a single survey paper. Thus, we restrict our attention here to interactive visualization of large data sets. Our focus then is skewed in a natural way towards query processing problem - provided by an underlying database system - rather than to the actual data visualization problem.
Article
Full-text available
Demonstrate the application of decision trees-classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)-to understand structure in missing data. Data taken from employees at 3 different industrial sites in Australia. 7915 observations were included. The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the 'rpart' and 'gbm' packages for CART and BRT analyses, respectively, from the statistical software 'R'. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Researchers are encouraged to use CART and BRT models to explore and understand missing data. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Article
Data visualization is a powerful tool to communicate data in a clear, digestible format through graphical means. To be effective, however, form and function need to work in tandem, filtering layers of noise to reveal the key aspects of the analyzed data. Indeed, this could prove to be sufficient in discovering already known patterns. Still, the search for undiscovered patterns would require the full dataset to be presented as a whole, which bears the risk of sensory overload. Human sensory systems function as a systemic unit in relation to one another, dynamically sampling the signals around us to give a concise scene analysis. To decipher a complex, multidimensional dataset, a representational system that is able to reproduce the layers of information through different stimulations would be required. This article explores the possibilities of using multimodal data representation as a method to communicate multidimensional data, guided by the principles of Gestalt psychology. Point Cloud, an artwork that implements such explorations through the visualization and sonification of lightning data is presented as an application of this research. The Web extra can be found at http://youtu.be/pQtxsvgv80E.
Conference Paper
Interactive ad-hoc analytics over large datasets has become an increasingly popular use case. We detail the challenges encountered when building a distributed system that allows the interactive exploration of a data cube. We introduce DICE, a distributed system that uses a novel session-oriented model for data cube exploration, designed to provide the user with interactive sub-second latencies for specified accuracy levels. A novel framework is provided that combines three concepts: faceted exploration of data cubes, speculative execution of queries and query execution over subsets of data. We discuss design considerations, implementation details and optimizations of our system. Experiments demonstrate that DICE provides a sub-second interactive cube exploration experience at the billion-tuple scale that is at least 33% faster than current approaches.
Conference Paper
Bayesian Networks are complex systems models that present rich output that can be difficult to communicate to users. In this paper a novel information visualization tool is evaluated for performance on accuracy, efficiency and user comprehension criteria. The visualization is tested across a range of user tasks, including identifying important information, inferring relationships between factors and comparing model outputs. While the interpretation of model output is less accurate for the visualization tool in question, this is balanced by significant gains in efficiency and user comprehension. It is suggested that the visualization is appropriate in contexts such as operational management where users refer to the tool often for support in making uncertain decisions, and can best be defined as a casual visualization to complement existing decision making activities on a daily basis.
Conference Paper
Visualization of uncertainty in datasets is a new field of research, which aims to represent incomplete data for analysis in real scenarios. In many cases, datasets, especially multi-dimensional datasets, often contain either errors or uncertain values. To address this challenge, we may treat these uncertainties as scalar values like probability. For visual representation in parallel coordinates, we draw a small "circle" to temporarily define a dummy vertex for an uncertain value of a data item, at the crossing point between polylines and the axis of certain dimension. Furthermore, these temporary positions of uncertainty could be permuted to achieve visual effectiveness. This feature provides a great opportunity by optimizing the order of uncertain values to tackle another important challenge in information visualization: clutter reduction. Visual clutter always obscures the visualizing structure even in small datasets. In this paper, we apply Sugiyama's layered directed graph drawing algorithm into parallel coordinates visualization to minimize the number of edge crossing among polylines, which has significantly improved the readability of visual structure. Experiments in case studies have shown the effectiveness of our new methods for clutter reduction in parallel coordinates visualization. These experiments also imply that besides visual clutter, the number of uncertain values and the type of multi-dimensional data are important attributes that affect visualization performance in this field.
Article
The considerable previous work characterizing visualization usage has focused on low-level tasks or interactions and high-level tasks, leaving a gap between them that is not addressed. This gap leads to a lack of distinction between the ends and means of a task, limiting the potential for rigorous analysis. We contribute a multi-level typology of visualization tasks to address this gap, distinguishing why and how a visualization task is performed, as well as what the task inputs and outputs are. Our typology allows complex tasks to be expressed as sequences of interdependent simpler tasks, resulting in concise and flexible descriptions for tasks of varying complexity and scope. It provides abstract rather than domain-specific descriptions of tasks, so that useful comparisons can be made between visualization systems targeted at different application domains. This descriptive power supports a level of analysis required for the generation of new designs, by guiding the translation of domain-specific problems into abstract tasks, and for the qualitative evaluation of visualization usage. We demonstrate the benefits of our approach in a detailed case study, comparing task descriptions from our typology to those derived from related work. We also discuss the similarities and differences between our typology and over two dozen extant classification systems and theoretical frameworks from the literatures of visualization, human-computer interaction, information retrieval, communications, and cartography.
  • A Buja
  • D Cook
  • D F Swayne
Buja, A., Cook, D., and Swayne, D. F. (1996), \Interactive High-Dimensional Data Visualization," Journal of Computational and Graphical Statistics, 5, pp. 78{99.
  • A R Unwin
  • G Hawkins
  • H Hofmann
  • B Siegl
Unwin, A. R., Hawkins, G., Hofmann, H., and Siegl, B. (1996), \Interactive Graphics for Data Sets with Missing Values { MANET," Journal of Computational and Graphical Statistics, 5, pp 113{122.
Article
Full-text available
XGobi is a data visualization system with state-of-the-art interactive and dynamic methods for the manipulation of views of data. It implements 2-D displays of projections of points and lines in high-dimensional spaces, as well as parallel coordinate displays and textual views thereof. Projection tools include dotplots of single variables, plots of pairs of variables, 3-D data rotations, various grand tours, and interactive projection pursuit. Views of the data can be reshaped. Points can be labeled and brushed with glyphs and colors. Lines can be edited and colored. Several XGobi processes can be run simultaneously and linked for labeling, brushing, and sharing of projections. Missing data are accommodated and their patterns can be examined; multiple imputations can be given to XGobi for rapid visual diagnostics. XGobi includes an extensive online help facility. XGobi can be integrated in other software systems, as has been done for the data analysis language S, the geographic information system (GIS) ArcView™, and the interactive multidimensional scaling program XGvis. XGobi is implemented in the X Window System™ for portability as well as the ability to run across a network.
Chapter
Data analysts perform a wide variety of computational tasks in the course of an analysis, and they should be able to do them all on the same platform. Xgobi helps to make this possible by bringing state-of-the-art dynamic graphic methods for the display and manipulation of scatter plots to the UNIX®workstation, and linking them to the S data analysis environment. Xgobi is implemented in the X Window System™, which offers portability across a wide variety of workstations, X terminals, and personal computers, as well as the ability to run smoothly across a network. A user can run xgobi and S simultaneously, making it possible to use each program’s functions on the same data. Using a highly interactive direct-manipulation design, xgobi offers an array of familiar scatterplot tools. Users can view pairwise plots, three-dimensional rotations and grand tour sequences. Views of the data can be reshaped, and points can be identified or brushed. Projection coefficients and brushing characteristics can be saved in a format usable by S, or in ASCII files. A user can run two or more independent xgobi processes and pass data between them. Xgobi also includes an extensive on-line help facility.
Article
Missing values are a problem for statistical methods. This applies just as much to modern methods such as interactive graphics as to more classical methods. The MANET software has been developed for keeping track of missing values in interactive graphics analyses and for investigating new interactive graphics tools.
Article
XGobi is a data visualization system with state-of-the-art interactive and dynamic methods for the manipulation of views of data. It implements 2-D displays of projections of points and lines in high-dimensional spaces, as well as parallel coordinate displays and textual views thereof. Projection tools include dotplots of single variables, plots of pairs of variables, 3-D data rotations, various grand tours, and interactive projection pursuit. Views of the data can be reshaped. Points can be labeled and brushed with glyphs and colors. Lines can be edited and colored. Several XGobi processes can be run simultaneously and linked for labeling, brushing, and sharing of projections. Missing data are accommodated and their patterns can be examined; multiple imputations can be given to XGobi for rapid visual diagnostics. XGobi includes an extensive online help facility. XGobi can be integrated in other software systems, as has been done for the data analysis language S, the geographic information system (GIS) Arc View™, and the interactive multidimensional scaling program XGvis. XGobi is implemented in the X Window System™ for portability as well as the ability to run across a network.
Article
In the past few years there has been a surge in the development of new statistical theories and methods that take advantage of the high speed digital computer. The payoff for such intensive computation methods is freedom from two limiting factors that have dominated statistical theory since its beginning: the assumption that the data conform to a bell-shaped curve and the need to focus on statistical measures whose theoretical properties can be analyzed mathematically. The new methods free the statistician to attack more complicated problems, exploiting a wider array of statistical tools. The bootstrap method is examined and evaluated as an example of this new generation of statistical tools.
Article
We propose a rudimentary taxonomy of interactive data visualization based on a triad of data analytic tasks: finding Gestalt, posing queries, and making comparisons. These tasks are supported by three classes of nteractive view manipulation: focusing, linking and arranging views. This discussion extends earlier work on the principles of focusing and linking and sets them on a firmer base. Next, we give a high-level introduction to a particular system for multivariate data visualization: XGobi. This introduction is not comprehensive but emphasizes XGobi tools that are examples of focusing, linking and arranging views, namely: high-dimensional projections, linked scatterplot brusing, and matrices of conditional plots.