Article
Visualization Databases for the Analysis of Large Complex Datasets.
Journal of Machine Learning Research  Proceedings Track 01/2009; 5:193200.
Source: DBLP

Article: Software Alchemy: Turning Complex Statistical Computations into EmbarrassinglyParallel Ones
[Show abstract] [Hide abstract]
ABSTRACT: The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on nonshared memory platforms. This paper develops a broadlyapplicable method for converting many nonEP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations.09/2014;  [Show abstract] [Hide abstract]
ABSTRACT: D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.Stat. 10/2012; 1(1). 
Article: Hadoop and MapReduce
[Show abstract] [Hide abstract]
ABSTRACT: As the need for largescale data analysis is rapidly increasing, Hadoop, or the platform that realizes largescale data processing, and MapReduce, or the internal computational model of Hadoop, are receiving great attention. This paper reviews the basic concepts of Hadoop and MapReduce necessary for data analysts who are familiar with statistical programming, through examples that combine the R programming language and Hadoop.Journal of the Korean Data and Information Science Society. 09/2013; 24(5).
Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.