Visualization Databases for the Analysis of Large Complex Datasets.

Journal of Machine Learning Research - Proceedings Track 01/2009; 5:193-200.
Source: DBLP

ABSTRACT Comprehensive visualization that preserves the information in a large complex dataset re- quires a visualization database (VDB): many displays, some with many pages, and with one or more panels per page. A single dis- play using a specific display method results from partitioning the data into subsets, sam- pling the subsets, and applying the method to each sample, typically one per panel. The time of the analyst to generate a display is not increased by choosing a large sample over a small one. Displays and display viewers can be designed to allow rapid scanning, and of- ten, it is not necessary to view every page of a display. VDBs, already successful just with off-the-shelf tools, can be greatly improved by a rethinking of all areas of data visual- ization in the context of a database of many large displays.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A streaming statistical algorithm detects SSH client keystroke packets in a TCP connection on any port. Input data are time-stamps and TCP-IP header fields of packets in both directions, measured at a monitor on the path between the hosts. No packet content is included. The algorithm uses the packet dynamics just preceding and following a client packet with data to classify the packet as a keystroke or non-keystroke. The dynamics are described by classification variables derived from the arrival time-stamps and the packet data sizes, sequence numbers, acknowledgement numbers, and flags. The algorithm succeeds because a keystroke creates an identifiable dynamical pattern. Final testing of the algorithm is based on analysis of about 1 million connections covering all common network protocols. Data visualization and the statistical design of experiments play a critical role in the analysis. It is common to treat the choice of tuning parameters of a statistical or machine learning algorithm as an optimization that finds one set of parameter values. Instead, we run a designed experiment that treats the tuning parameters as statistical tuning factors, which yields valuable information about algorithm performance. One application of the algorithm is identification of any TCP connection as an SSH interactive session, allowing detection of backdoor SSH servers. More generally, the algorithm demonstrates the potential for the use of detailed packet dynamics to classify connections, important for network security. The algorithm is has been prototyped in the widely-used Argus traffic audit software system.
  • [Show abstract] [Hide abstract]
    ABSTRACT: D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.
    Stat. 10/2012; 1(1).
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the need for large-scale data analysis is rapidly increasing, Hadoop, or the platform that realizes large-scale data processing, and MapReduce, or the internal computational model of Hadoop, are receiving great attention. This paper reviews the basic concepts of Hadoop and MapReduce necessary for data analysts who are familiar with statistical programming, through examples that combine the R programming language and Hadoop.
    Journal of the Korean Data and Information Science Society. 01/2013; 24(5).


Available from