Visualization Databases for the Analysis of Large Complex Datasets.

Journal of Machine Learning Research - Proceedings Track 01/2009; 5:193-200.
Source: DBLP

ABSTRACT Comprehensive visualization that preserves the information in a large complex dataset re- quires a visualization database (VDB): many displays, some with many pages, and with one or more panels per page. A single dis- play using a specific display method results from partitioning the data into subsets, sam- pling the subsets, and applying the method to each sample, typically one per panel. The time of the analyst to generate a display is not increased by choosing a large sample over a small one. Displays and display viewers can be designed to allow rapid scanning, and of- ten, it is not necessary to view every page of a display. VDBs, already successful just with off-the-shelf tools, can be greatly improved by a rethinking of all areas of data visual- ization in the context of a database of many large displays.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the need for large-scale data analysis is rapidly increasing, Hadoop, or the platform that realizes large-scale data processing, and MapReduce, or the internal computational model of Hadoop, are receiving great attention. This paper reviews the basic concepts of Hadoop and MapReduce necessary for data analysts who are familiar with statistical programming, through examples that combine the R programming language and Hadoop.
    09/2013; 24(5). DOI:10.7465/jkdi.2013.24.5.1013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadly-applicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.
  • [Show abstract] [Hide abstract]
    ABSTRACT: D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.
    10/2012; 1(1). DOI:10.1002/sta4.7


1 Download
Available from