Preprint

Scatterplot Selection Applying a Graph Coloring Problem

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Scatterplot selection is an effective approach to represent essential portions of multidimensional data in a limited display space. Various metrics for evaluation of scatterplots such as scagnostics have been presented and applied to scatterplot selection. This paper presents a new scatterplot selection technique that applies multiple metrics. The technique firstly calculates scores of scatterplots with multiple metrics and then constructs a graph by connecting similar scatterplots. The technique applies a graph coloring problem so that different colors are assigned to similar scatterplots. We can extract a set of various scatterplots by selecting them that the specific same color is assigned. This paper introduces visualization examples with a retail dataset containing multidimensional climate and sales values.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we examine the robustness of scagnostics through a series of theoretical and empirical studies. First, we investigate the sensitivity of scagnostics by employing perturbing operations on more than 60M synthetic and real-world scatterplots. We found that two scagnostic measures, Outlying and Clumpy , are overly sensitive to data binning. To understand how these measures align with human judgments of visual features, we conducted a study with 24 participants, which reveals that i) humans are not sensitive to small perturbations of the data that cause large changes in both measures, and ii) the perception of clumpiness heavily depends on per-cluster topologies and structures. Motivated by these results, we propose Robust Scagnostics ( RScag ) by combining adaptive binning with a hierarchy-based form of scagnostics. An analysis shows that RScag improves on the robustness of original scagnostics, aligns better with human judgments, and is equally fast as the traditional scagnostic measures.
Article
Full-text available
Data analysis often involves finding models that can explain patterns in data, and reduce possibly large data sets to more compact model-based representations. In Statistics, many methods are available to compute model information. Among others, regression models are widely used to explain data. However, regression analysis typically searches for the best model based on the global distribution of data. On the other hand, a data set may be partitioned into subsets, each requiring individual models. While automatic data subsetting methods exist, these often require parameters or domain knowledge to work with. We propose a system for visual-interactive regression analysis for scatter plot data, supporting both global and local regression modeling. We introduce a novel regression lens concept, allowing a user to interactively select a portion of data, on which regression analysis is run in interactive time. The lens gives encompassing visual feedback on the quality of candidate models as it is interactively navigated across the input data. While our regression lens can be used for fully interactive modeling, we also provide user guidance suggesting appropriate models and data subsets, by means of regression quality scores. We show, by means of use cases, that our regression lens is an effective tool for user-driven regression modeling and supports model understanding.
Conference Paper
Full-text available
Our goal is to accurately model human class separation judgements in color-coded scatterplots. Towards this goal, we propose a set of 2002 visual separation measures, by systematically combining 17 neighborhood graphs and 14 class purity functions, with different parameterizations. Using a Machine Learning framework, we evaluate these measures based on how well they predict human separation judgements. We found that more than 58% of the 2002 new measures outperform the best state-of-the-art Distance Consistency (DSC) measure. Among the 2002, the best measure is the average proportion of same-class neighbors among the 0.35-Observable Neighbors of each point of the target class (short GONG 0.35 DIR CPT), with a prediction accuracy of 92.9%, which is 11.7% better than DSC. We also discuss alternative, well-performing measures and give guidelines when to use which.
Article
Full-text available
For high-dimensional data, this work proposes two novel visual exploration methods to gain insights into the data aspect and the dimension aspect of the data. The first is a Dimension Projection Matrix, as an extension of a scatterplot matrix. In the matrix, each row or column represents a group of dimensions, and each cell shows a dimension projection (such as MDS) of the data with the corresponding dimensions. The second is a Dimension Projection Tree, where every node is either a dimension projection plot or a Dimension Projection Matrix. Nodes are connected with links and each child node in the tree covers a subset of the parent node's dimensions or a subset of the parent node's data items. While the tree nodes visualize the subspaces of dimensions or subsets of the data items under exploration, the matrix nodes enable cross-comparison between different combinations of subspaces. Both Dimension Projection Matrix and Dimension Project Tree can be constructed algorithmically through automation, or manually through user interaction. Our implementation enables interactions such as drilling down to explore different levels of the data, merging or splitting the subspaces to adjust the matrix, and applying brushing to select data clusters. Our method enables simultaneously exploring data correlation and dimension correlation for data with high dimensions.
Conference Paper
Full-text available
We introduce Tukey and Tukey scagnostics and develop graph-theoretic methods for implementing their procedure on large datasets.
Article
Scatterplot matrices (SPLOMs) are widely used for exploring multidimensional data. Scatterplot diagnostics (scagnostics) approaches measure characteristics of scatterplots to automatically find potentially interesting plots, thereby making SPLOMs more scalable with the dimension count. While statistical measures such as regression lines can capture orientation, and graph-theoretic scagnostics measures can capture shape, there is no scatterplot characterization measure that uses both descriptors. Based on well-known results in shape analysis, we propose a scagnostics approach that captures both scatterplot shape and orientation using skeletons (or medial axes). Our representation can handle complex spatial distributions, helps discovery of principal trends in a multiscale way, scales visually well with the number of samples, is robust to noise, and is automatic and fast to compute. We define skeleton-based similarity metrics for the visual exploration and analysis of SPLOMs. We perform a user study to measure the human perception of scatterplot similarity and compare the outcome to our results as well as to graph-based scagnostics and other visual quality metrics. Our skeleton-based metrics outperform previously defined measures both in terms of closeness to perceptually-based similarity and computation time efficiency.
Article
Parallel coordinate plots (PCPs) are among the most useful techniques for the visualization and exploration of high-dimensional data spaces. They are especially useful for the representation of correlations among the dimensions, which identify relationships and interdependencies between variables. However, within these high-dimensional spaces, PCPs face difficulties in displaying the correlation between combinations of dimensions and generally require additional display space as the number of dimensions increases. In this paper, we present a new technique for high-dimensional data visualization in which a set of low-dimensional PCPs are interactively constructed by sampling user-selected subsets of the high-dimensional data space. In our technique, we first construct a graph visualization of sets of well-correlated dimensions. Users observe this graph and are able to interactively select the dimensions by sampling from its cliques, thereby dynamically specifying the most relevant lower dimensional data to be used for the construction of focused PCPs. Our interactive sampling overcomes the shortcomings of the PCPs by enabling the visualization of the most meaningful dimensions (i.e., the most relevant information) from high-dimensional spaces. We demonstrate the effectiveness of our technique through two case studies, where we show that the proposed interactive low-dimensional space constructions were pivotal for visualizing the high-dimensional data and discovering new patterns.
Article
Despite years of research yielding systems and guidelines to aid visualization design, practitioners still face the challenge of identifying the best visualization for a given dataset and task. One promising approach to circumvent this problem is to leverage perceptual laws to quantitatively evaluate the effectiveness of a visualization design. Following previously established methodologies, we conduct a large scale (n = 1687) crowdsourced experiment to investigate whether the perception of correlation in nine commonly used visualizations can be modeled using Weber's law. The results of this experiment contribute to our understanding of information visualization by establishing that: (1) for all tested visualizations, the precision of correlation judgment could be modeled by Weber's law, (2) correlation judgment precision showed striking variation between negatively and positively correlated data, and (3) Weber models provide a concise means to quantify, compare, and rank the perceptual precision afforded by a visualization.
Article
Multi-dimensional data visualization is an important research topic that has been receiving increasing attention. Several techniques that apply scatterplot matrices have been proposed to represent multi-dimensional data as a collection of two-dimensional data visualization spaces. Typically, when using the scatterplot-based approach it is easier to understand relations between particular pairs of dimensions, but it often requires too large display spaces to display all possible scatterplots. This paper presents a technique to display meaningful sets of scatterplots generated from high-dimensional datasets. Our technique first evaluates all possible scatterplots generated from high-dimensional datasets, and selects meaningful sets. It then calculates the similarity between arbitrary pairs of the selected scatterplots, and places relevant scatterplots closer together in the display space while they never overlap each other. This design policy makes users easier to visually compare relevant sets of scatterplots. This paper presents algorithms to place the scatterplots by the combination of ideal position calculation and rectangle packing algorithms, and two examples demonstrating the effectiveness of the presented technique. Graphical Abstract
Article
Correlation analysis can reveal the complex relationships that often exist among the variables in multivariate data. However, as the number of variables grows, it can be difficult to gain a good understanding of the correlation landscape and important intricate relationships might be missed. We previously introduced a technique that arranged the variables into a 2D layout, encoding their pairwise correlations. We then used this layout as a network for the interactive ordering of axes in parallel coordinate displays. Our current work expresses the layout as a correlation map and employs it for visual correlation analysis. In contrast to matrix displays where correlations are indicated at intersections of rows and columns, our map conveys correlations by spatial proximity which is more direct and more focused on the variables in play. We make the following new contributions, some unique to our map: (1) we devise mechanisms that handle both categorical and numerical variables within a unified framework, (2) we achieve scalability for large numbers of variables via a multi-scale semantic zooming approach, (3) we provide interactive techniques for exploring the impact of value bracketing on correlations, and (4) we visualize data relations within the sub-spaces spanned by correlated variables by projecting the data into a corresponding tessellation of the map.
Article
Dimension reduction techniques are essential for feature selection and feature extraction of complex high-dimensional data. These techniques, which construct low-dimensional representations of data, are typically geometrically motivated, computationally efficient and approximately preserve certain structural properties of the data. However, they are often used as black box solutions in data exploration and their results can be difficult to interpret. To assess the quality of these results, quality measures, such as co-ranking [LV09], have been proposed to quantify structural distortions that occur between high-dimensional and low-dimensional data representations. Such measures could be evaluated and visualized point-wise to further highlight erroneous regions [MLGH13]. In this work, we provide an interactive visualization framework for exploring high-dimensional data via its two-dimensional embeddings obtained from dimension reduction, using a rich set of user interactions. We ask the following question: what new insights do we obtain regarding the structure of the data, with interactive manipulations of its embeddings in the visual space? We augment the two-dimensional embeddings with structural abstractions obtained from hierarchical clusterings, to help users navigate and manipulate subsets of the data. We use point-wise distortion measures to highlight interesting regions in the domain, and further to guide our selection of the appropriate level of clusterings that are aligned with the regions of interest. Under the static setting, point-wise distortions indicate the level of structural uncertainty within the embeddings. Under the dynamic setting, on-the-fly updates of point-wise distortions due to data movement and data deletion reflect structural relations among different parts of the data, which may lead to new and valuable insights.
Article
We provide two contributions, a taxonomy of visual cluster separation factors in scatterplots, and an in-depth qualitative evaluation of two recently proposed and validated separation measures. We initially intended to use these measures to provide guidance for the use of dimension reduction (DR) techniques and visual encoding (VE) choices, but found that they failed to produce reliable results. To understand why, we conducted a systematic qualitative data study covering a broad collection of 75 real and synthetic high-dimensional datasets, four DR techniques, and three scatterplot-based visual encodings. Two authors visually inspected over 800 plots to determine whether or not the measures created plausible results. We found that they failed in over half the cases overall, and in over two-thirds of the cases involving real datasets. Using open and axial coding of failure reasons and separability characteristics, we generated a taxonomy of visual cluster separability factors. We iteratively refined its explanatory clarity and power by mapping the studied datasets and success and failure ranges of the measures onto the factor axes. Our taxonomy has four categories, ordered by their ability to influence successors: Scale, Point Distance, Shape, and Position. Each category is split into Within-Cluster factors such as density, curvature, isotropy, and clumpiness, and Between-Cluster factors that arise from the variance of these properties, culminating in the overarching factor of class separation. The resulting taxonomy can be used to guide the design and the evaluation of cluster separation measures. © 2012 Wiley Periodicals, Inc.
Conference Paper
A scatter plot displays a relation between a pair of variables. Given a set of v variables, there are v(v--1)/2 pairs of variables, and thus the same number of possible pair wise scatter plots. Therefore for even small sets of variables, the number of scatter plots can be large. Scatter plot matrices (SPLOMs) can easily run out of pixels when presenting high-dimensional data. We introduce a theoretical method and a testbed for assessing whether our method can be used to guide interactive exploration of high-dimensional data. The method is based on nine characterizations of the 2D distributions of orthogonal pair wise projections on a set of points in multidimensional Euclidean space. Working directly with these characterizations, we can locate anomalies for further analysis or search for similar distributions in a large SPLOM with more than a hundred dimensions. Our testbed, ScagExplorer, is developed in order to evaluate the feasibility of handling huge collections of scatter plots.
Conference Paper
Multidimensional data visualization is an important research topic that has been receiving increasing attention. Several techniques that use parallel coordinate plots have been proposed to represent all dimensions of data in a single display space. In addition, several other techniques that apply scatter plot matrices have been proposed to represent multidimensional data as a collection of low-dimensional data visualization spaces. Typically, when using the latter approach it is easier to understand relations among particular dimensions, but it is often difficult to observe relations between dimensions separated into different visualization spaces. This paper presents a framework for displaying an arrangement of low-dimensional data visualization spaces that are generated from high-dimensional datasets. Our proposed technique first divides the dimensions of the input datasets into groups of lower dimensions based on their correlations or other relationships. If the groups of lower dimensions can be visualized in independent rectangular spaces, our technique packs the set of low-dimensional data visualizations into a single display space. Because our technique places relevant low-dimensions closer together in the display space, it is easier to visually compare relevant sets of low-dimensional data visualizations. In this paper, we describe in detail how we implement our framework using parallel coordinate plots, and present several results demonstrating its effectiveness.
Article
Multivariate data visualization is a classic topic, for which many solutions have been proposed, each with its own strengths and weaknesses. In standard solutions the structure of the visualization is fixed, we explore how to give the user more freedom to define visualizations. Our new approach is based on the usage of Flexible Linked Axes: The user is enabled to define a visualization by drawing and linking axes on a canvas. Each axis has an associated attribute and range, which can be adapted. Links between pairs of axes are used to show data in either scatter plot- or Parallel Coordinates Plot-style. Flexible Linked Axes enable users to define a wide variety of different visualizations. These include standard methods, such as scatter plot matrices, radar charts, and PCPs [11]; less well known approaches, such as Hyperboxes [1], TimeWheels [17], and many-to-many relational parallel coordinate displays [14]; and also custom visualizations, consisting of combinations of scatter plots and PCPs. Furthermore, our method allows users to define composite visualizations that automatically support brushing and linking. We have discussed our approach with ten prospective users, who found the concept easy to understand and highly promising.
Article
Many visualization techniques involve mapping high-dimensional data spaces to lower-dimensional views. Unfortunately, mapping a high-dimensional data space into a scatterplot involves a loss of information; or, even worse, it can give a misleading picture of valuable structure in higher dimensions. In this paper, we propose class consistency as a measure of the quality of the mapping. Class consistency enforces the constraint that classes of n–D data are shown clearly in 2–D scatterplots. We propose two quantitative measures of class consistency, one based on the distance to the class’s center of gravity, and another based on the entropies of the spatial distributions of classes. We performed an experiment where users choose good views, and show that class consistency has good precision and recall. We also evaluate both consistency measures over a range of data sets and show that these measures are efficient and robust.
A structure-based distance metric for high-dimensional space exploration with multidimensional scaling
  • J H Lee
  • K T Mcdonell
  • A Zelenyuk
  • D Imre
  • K Muller
J. H. Lee, K. T. McDonell, A. Zelenyuk, D. Imre, and K. Muller. A structure-based distance metric for high-dimensional space exploration with multidimensional scaling. IEEE Transaction on Computer Graphics, 20(3):351-364, 2013.
Graph coloring algorithms for photo selection
  • N Morishita
  • M Hagita
  • H Shioya
  • T Itoh
N. Morishita, M. Hagita, H. Shioya, and T. Itoh. Graph coloring algorithms for photo selection. Conference on Applied Mathematics (in Japanese), 106-109, 2016.
A network-based interface for the exploration of high-dimensional data spaces
  • Z Zhang
  • K T Mcdonnell
  • K Mueller
Z. Zhang, K. T. McDonnell, and K. Mueller. A network-based interface for the exploration of high-dimensional data spaces. IEEE Pacific Visualization Symposium 2012, 17-24, 2012.