Project

Co-design Center for Online Data Analysis and Reduction

Goal: What are the best data analysis and reduction algorithms for different application classes, in terms of speed, accuracy, and resource requirements? How can we implement those algorithms to achieve scalability and performance portability?

What are the tradeoffs in data analysis accuracy, resource needs, and overall application performance between using various data reduction methods to reduce file size prior to offline data reconstruction and analysis vs. performing more online data analysis? How do these tradeoffs vary with exascale hardware and software choices?

How do we effectively orchestrate online data analysis and reduction to reduce associated overheads? How can exascale hardware and software help with orchestration?

Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
37
Reads
2 new
722

Project log

Dingwen Tao
added 2 research items
A growing disparity between supercomputer computation speeds and I/O rates makes it increasingly infeasible for applications to save all results for offline analysis. Instead, applications must analyze and reduce data online so as to output only those results needed to answer target scientific question(s). This change in focus complicates application and experiment design and introduces algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design of various elements of supercomputer systems. We review these challenges and describe methods and tools that we are developing to enable experimental exploration of algorithmic, software, and system design alternatives.
Because of vast volume of data being produced by today's scientific simulations and experiments, lossy data compressor allowing user-controlled loss of accuracy during the compression is a relevant solution for significantly reducing the data size. However, lossy compressor developers and users are missing a tool to explore the features of scientific datasets and understand the data alteration after compression in a systematic and reliable way. To address this gap, we have designed and implemented a generic framework called Z-checker. On the one hand, Z-checker combines a battery of data analysis components for data compression. On the other hand, Z-checker is implemented as an open-source community tool to which users and developers can contribute and add new analysis components based on their additional analysis demands. In this paper, we present a survey of existing lossy compressors. Then we describe the design framework of Z-checker, in which we integrated evaluation metrics proposed in prior work as well as other analysis tools. Specifically, for lossy compressor developers, Z-checker can be used to characterize critical properties of any dataset to improve compression strategies. For lossy compression users, Z-checker can detect the compression quality, provide various global distortion analysis comparing the original data with the decompressed data and statistical analysis of the compression error. Z-checker can perform the analysis with either coarse granularity or fine granularity, such that the users and developers can select the best-fit, adaptive compressors for different parts of the dataset. Z-checker features a visualization interface displaying all analysis results in addition to some basic views of the datasets such as time series. To the best of our knowledge, Z-checker is the first tool designed to assess lossy compression comprehensively for scientific datasets.
Ian Foster
added 2 research items
High accuracy scientific simulations on high performance computing (HPC) platforms generate large amounts of data. To allow data to be efficiently analyzed, simulation outputs need to be refactored, compressed, and properly mapped onto storage tiers. This paper presents Canopus, a progressive data management framework for storing and analyzing big scientific data. Canopus allows simulation results to be refactored into a much smaller dataset along with a series of deltas with fairly low overhead. Then, the refactored data are compressed, mapped, and written onto storage tiers. For data analytics, refactored data are selectively retrieved to restore data at a specific level of accuracy that satisfies analysis requirements. Canopus enables end users to make trade-offs between analysis speed and accuracy on the fly. Canopus is demonstrated and thoroughly evaluated using blob detection on fusion simulation data.
Ian Foster
added a research item
A growing disparity between supercomputer computation speeds and I/O rates makes it increasingly infeasible for applications to save all results for offline analysis. Instead, applications must analyze and reduce data online so as to output only those results needed to answer target scientific question(s). This change in focus complicates application and experiment design and introduces algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design of various elements of supercomputer systems. We review these challenges and describe methods and tools that we are developing to enable experimental exploration of algorithmic, software, and system design alternatives.
Ian Foster
added a project goal
What are the best data analysis and reduction algorithms for different application classes, in terms of speed, accuracy, and resource requirements? How can we implement those algorithms to achieve scalability and performance portability?
What are the tradeoffs in data analysis accuracy, resource needs, and overall application performance between using various data reduction methods to reduce file size prior to offline data reconstruction and analysis vs. performing more online data analysis? How do these tradeoffs vary with exascale hardware and software choices?
How do we effectively orchestrate online data analysis and reduction to reduce associated overheads? How can exascale hardware and software help with orchestration?