Visualization Databases for the Analysis of Large Complex Datasets.

Journal of Machine Learning Research (Impact Factor: 2.47). 01/2009; 5:193-200.
Source: DBLP


Comprehensive visualization that preserves the information in a large complex dataset re- quires a visualization database (VDB): many displays, some with many pages, and with one or more panels per page. A single dis- play using a specific display method results from partitioning the data into subsets, sam- pling the subsets, and applying the method to each sample, typically one per panel. The time of the analyst to generate a display is not increased by choosing a large sample over a small one. Displays and display viewers can be designed to allow rapid scanning, and of- ten, it is not necessary to view every page of a display. VDBs, already successful just with off-the-shelf tools, can be greatly improved by a rethinking of all areas of data visual- ization in the context of a database of many large displays.

Download full-text


Available from: Paul Kidwell, May 18, 2015
18 Reads
  • Source
    • "But the emphasis in this section is on " C without the A, " a very useful variant of CA that was essentially the strategy used in [7]. We will call it CWA, meaning that we divide into chunks and apply some estimator to each chunk, but then somehow use the resulting collection of estimators separately rather than, say, averaging them. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadly-applicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.
  • Source
    • "While these choices are largely up to the user to specify, a good Trellis implementation will provide mechanisms to specify and explore these choices with ease. More details on these and other considerations can be found in [11] and [2] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Trelliscope emanates from the Trellis Display framework for visualization and the Divide and Recombine (D&R) approach to analyzing large complex data. In Trellis, the data are broken up into subsets, a visualization method is applied to each subset, and the display result is an array of panels, one per subset. This is a powerful framework for visualization of data, both small and large. In D&R, the data are broken up into subsets, and any analytic method from statistics and machine learning is applied to each subset independently. Then the outputs are recombined. This provides not only a powerful framework for analysis, but also feasible and practical computations using distributed computational facilities. It enables deep analysis of the data: study of both data summaries as well as the detailed data at their finest granularity. This is critical to full understanding of the data. It also enables the analyst to program using an interactive high-level language for data analysis such as R, which allows the analyst to focus more on the data and less on code. In this paper we introduce Trelliscope, a system that scales Trellis to large complex data. It provides a way to create displays with a very large number of panels and an interactive viewer that allows the analyst to sort, filter, and sample the panels in a meaningful way. We discuss the underlying principles, design, and scalable architecture of Trelliscope, and illustrate its use on three analysis projects in proteomics, high intensity physics, and power systems engineering.
    Large-Scale Data Analysis and Visualization (LDAV), 2013 IEEE Symposium on; 01/2013
  • Source
    • "Progress has been slow on the idea because study of packet dynamics for network security requires detailed, comprehensive statistical analysis of large, complex packet-level databases; just passing over the data and creating statistical summaries does not shed sufficient light on the dynamics. Recently, though there have been major advances in computational environments for analysis of large, complex datasets, and our work has made use of them [7]. It might be thought that aggregate statistics would scale much more readily because the packet dynamics require more detail, at least formally. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A streaming statistical algorithm detects SSH client keystroke packets in a TCP connection on any port. Input data are time-stamps and TCP-IP header fields of packets in both directions, measured at a monitor on the path between the hosts. No packet content is included. The algorithm uses the packet dynamics just preceding and following a client packet with data to classify the packet as a keystroke or non-keystroke. The dynamics are described by classification variables derived from the arrival time-stamps and the packet data sizes, sequence numbers, acknowledgement numbers, and flags. The algorithm succeeds because a keystroke creates an identifiable dynamical pattern. Final testing of the algorithm is based on analysis of about 1 million connections covering all common network protocols. Data visualization and the statistical design of experiments play a critical role in the analysis. It is common to treat the choice of tuning parameters of a statistical or machine learning algorithm as an optimization that finds one set of parameter values. Instead, we run a designed experiment that treats the tuning parameters as statistical tuning factors, which yields valuable information about algorithm performance. One application of the algorithm is identification of any TCP connection as an SSH interactive session, allowing detection of backdoor SSH servers. More generally, the algorithm demonstrates the potential for the use of detailed packet dynamics to classify connections, important for network security. The algorithm is has been prototyped in the widely-used Argus traffic audit software system.
Show more