Available via license: CC BY 4.0
Content may be subject to copyright.
AMRIC: A Novel In Situ Lossy Compression Framework for
Eicient I/O in Adaptive Mesh Refinement Applications
Daoce Wang
Indiana University
Bloomington, IN, USA
daocwang@iu.edu
Jesus Pulido
Los Alamos National Lab
Los Alamos, NM, USA
pulido@lanl.gov
Pascal Grosset
Los Alamos National Lab
Los Alamos, NM, USA
pascalgrosset@lanl.gov
Jiannan Tian
Indiana University
Bloomington, IN, USA
jti1@iu.edu
Sian Jin
Indiana University
Bloomington, IN, USA
sianjin@iu.edu
Houjun Tang
Lawrence Berkeley
National Lab
Berkeley, CA, USA
htang4@lbl.gov
Jean Sexton
Lawrence Berkeley
National Lab
Berkeley, CA, USA
jmsexton@lbl.gov
Sheng Di
Argonne National Lab
Lemont, IL, USA
sdi1@anl.gov
Zarija Lukić
Lawrence Berkeley
National Lab
Berkeley, CA, USA
zarija@lbl.gov
Kai Zhao
Florida State University
Tallahassee, FL, USA
kzhao@cs.fsu.edu
Bo Fang
Pacic Northwest National
Lab
Richland, WA, USA
bo.fang@pnnl.gov
Franck Cappello
Argonne National Lab
Lemont, IL, USA
cappello@mcs.anl.gov
James Ahrens
Los Alamos National Lab
Los Alamos, NM, USA
ahrens@lanl.gov
Dingwen Tao∗
Indiana University
Bloomington, IN, USA
ditao@iu.edu
ABSTRACT
As supercomputers advance towards exascale capabilities, compu-
tational intensity increases signicantly, and the volume of data
requiring storage and transmission experiences exponential growth.
Adaptive Mesh Renement (AMR) has emerged as an eective solu-
tion to address these two challenges. Concurrently, error-bounded
lossy compression is recognized as one of the most ecient ap-
proaches to tackle the latter issue. Despite their respective advan-
tages, few attempts have been made to investigate how AMR and
error-bounded lossy compression can function together. To this end,
this study presents a novel in-situ lossy compression framework
that employs the HDF5 lter to improve both I/O costs and boost
compression quality for AMR applications. We implement our so-
lution into the AMReX framework and evaluate on two real-world
AMR applications, Nyx and WarpX, on the Summit supercomputer.
Experiments with 4096 CPU cores demonstrate that AMRIC im-
proves the compression ratio by up to 81
×
and the I/O performance
by up to 39×over AMReX’s original compression solution.
CCS CONCEPTS
•Theory of computation
→
Data compression;•Computing
methodologies
→
Massively parallel and high-performance
simulations.
KEYWORDS
Lossy compression, AMR, I/O, performance.
∗
Corresponding author: Dingwen Tao, Department of Intelligent Systems Engineering,
Luddy School of Computing, Informatics, and Engineering, Indiana University.
Publication rights licensed to ACM. ACM acknowledges that this contribution was
authored or co-authored by an employee, contractor or aliate of the United States
government. As such, the Government retains a nonexclusive, royalty-free right to
publish or reproduce this article, or to allow others to do so, for Government purposes
only.
SC ’23, Nov 12–17, 2023, Denver, CO
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8442-1/21/11.
https://doi.org/XX.XXXX/XXXXXXX.XXXXXXX
ACM Reference Format:
Daoce Wang, Jesus Pulido, Pascal Grosset, Jiannan Tian, Sian Jin, Houjun
Tang, Jean Sexton, Sheng Di, Zarija Lukić, Kai Zhao, Bo Fang, Franck Cap-
pello, James Ahrens, and Dingwen Tao. 2023. AMRIC: A Novel In Situ Lossy
Compression Framework for Ecient I/O in Adaptive Mesh Renement
Applications. In Proceedings of The International Conference for High Perfor-
mance Computing, Networking, Storage, and Analysis (SC ’23). ACM, New
York, NY, USA, 12 pages. https://doi.org/XX.XXXX/XXXXXXX.XXXXXXX
1 INTRODUCTION
In recent years, scientic simulations have experienced a dramatic
increase in both scale and expense. To address this issue, many high-
performance computing (HPC) simulation packages, like AMReX
[
41
] and Athena++ [
31
], have employed Adaptive Mesh Rene-
ment (AMR) as a technique to decrease computational costs while
maintaining or even improving the accuracy of simulation results.
Unlike traditional uniform mesh methods that apply consistent
resolution throughout the entire simulation domain, AMR oers a
more ecient approach by dynamically adjusting the resolution and
focusing on higher resolution in crucial areas, thereby conserving
computational resources and storage needs.
Although AMR data can reduce output data size, the reduction
may not be substantial enough for scientic simulations resulting
in high I/O and storage costs. For example, an AMR simulation with
a resolution of 2048
3
(i.e., 0
.
5
×
1024
3
mesh points in the coarse
level and 0
.
5
×
2048
3
in the ne level) can generate up to 1 TB of
data for a single snapshot with all data elds dumped; a total of 1
PB of disk storage is needed, assuming the simulation is run in an
ensemble of ve times with 200 snapshots dumped per simulation.
To this end, data compression approaches could be utilized in con-
junction with AMR techniques to further save I/O and storage costs.
However, traditional lossless compression methods have limited
eectiveness in reducing the massive amounts of data generated by
scientic simulations, typically achieving only up to a compression
ratio of 2
×
. As a solution, a new generation of error-bounded lossy
compression techniques, such as SZ [
10
,
20
,
32
], ZFP [
22
], MGARD
arXiv:2307.09609v1 [cs.DC] 13 Jul 2023
SC ’23, Nov 12–17, 2023, Denver, CO Wang et al.
[
2
] and their GPU versions [
8
,
36
,
37
], have been widely used in the
scientic community [5, 7, 10, 14, 17, 19, 20, 22, 24, 25, 32, 34].
While lossy compression holds the potential to signicantly re-
duce I/O and storage costs associated with AMR simulations, there
has been limited research on using lossy compression in AMR sim-
ulations. Two recent studies have aimed to devise ecient lossy
compression methods for AMR datasets. Luo et al. [
26
] proposed
zMesh, which reorders AMR data across dierent renement levels
into a 1D array to leverage data redundancy. However, by compress-
ing data in a 1D array, zMesh is unable to exploit higher-dimension
compression, leading to a loss of topology information and data
locality in higher-dimension data. In contrast, Wang et al. [
39
]
developed TAC to enhance zMesh’s compression quality through
adaptive 3D compression. While zMesh and TAC oer oine com-
pression solutions for AMR data, they are not suitable for in situ
compression of AMR data. We will discuss these works and their
limitations for in situ compression in detail in §5.
On the other hand, in situ compression of AMR data could en-
hance I/O eciency by compressing data during the application’s
runtime, allowing for the direct writing of smaller, compressed
data to storage systems. This approach would eliminate the need to
transfer large amounts of original data between computing nodes
and storage systems, further streamlining the process. AMReX cur-
rently supports in situ compression for AMR data [
3
]; however, the
current implementation converts the high-dimensional data into
a 1D array before compression, which limits the compression per-
formance without the additional spatial information. Additionally,
it utilizes a small HDF5 chunk size, leading to lower compression
ratios and reduced I/O performance. These limitations will be dis-
cussed in more detail in Sections 2.1 and 3.3.
To address these issues, we propose an eective in situ lossy
compression framework for AMR simulations, called AMRIC, that
enhances I/O performance and compression quality. Dierent from
AMReX’s naïve in situ compression approach, with a customized
pre-processing process, AMRIC can perform 3D compression and
leverage its high compression ratio. Additionally, we incorporate
the HDF5 compression lter to further improve I/O performance
and usability. Our primary contributions are outlined below:
•
We propose a rst-of-its-kind 3D in situ AMR data compres-
sion framework through HDF5 (called AMRIC1).
•
We design a compression-oriented pre-processing workow
for AMR data, which involves removing redundant data,
uniformly truncating the remaining data into 3D blocks, and
reorganizing the blocks based on dierent compressors.
•
We employ the state-of-the-art lossy compressor SZ (with
two dierent algorithms/variants) and further optimize it to
improve the compression quality for AMR data. This involves
utilizing Shared Lossless Encoding (SLE) and adaptive block
size in the SZ compressor to enhance prediction quality and
hence improve the compression quality.
•
To eciently utilize the HDF5 compression lter on AMR
data, we modify the data layout and the original compression
lter to adopt a larger HDF5 chunk size without introducing
extra storage overhead. This enables higher compression
throughput, improving both I/O and compression quality.
1The code is available at https://github.com/SC23-AMRIC/SC23- AMRIC.
•
We integrate AMRIC into the AMReX framework and eval-
uate it on two real-world AMReX applications, WarpX and
Nyx, using the Summit supercomputer.
•
Experimental results demonstrate that AMRIC can signi-
cantly outperform non-compression solution and AMReX’s
original compression solution in terms of I/O performance
and compression quality.
The remainder of this paper is organized as follows. In §2, we pro-
vide an overview of error-bounded lossy compression for scientic
data, the HDF5 le format, AMR approaches, and data structures,
as well as a review of related work in the eld of AMR data com-
pression. In §3, we describe our proposed 3D in situ compression
strategies in detail. In §4, we present the experimental results of
AMRIC and comparisons with existing approaches. In §5, we dis-
cuss related work and their limitations. In §6, we conclude this
work and outline potential future research.
2 BACKGROUND AND MOTIVATION
In this section, we introduce some background information on
the HDF5 format and its lter mechanism, lossy compression for
scientic data, and AMR methods and AMR data.
2.1 HDF5 Format and HDF5 Filter
An essential technique for minimizing I/O time associated with
the vast data generated by large-scale HPC applications is parallel
I/O. Numerous parallel I/O libraries exist, including HDF5 [
13
]
and NetCDF [
30
]. In this study, we focus on HDF5, as it is widely
embraced by the HPC community and hydrodynamic simulations
such as Nyx [
27
], a simulation to model astrophysical reacting
ows on HPC systems, and VPIC [
6
], a large-scale plasma physics
simulation. Moreover, HDF5 natively supports data compression
lters [
1
] such as H5Z-SZ [
9
] and H5Z-ZFP [
23
]. HDF5 allows
chunked data to pass through user-dened compression lters on
the way to or from the storage system [
35
]. This means that the
data can be compressed/decompressed using a compression lter
during the read/write operation.
Selecting the optimal chunk size when using compression lters
in parallel scenarios is often challenging due to dataset chunking
being necessary to enable the use of I/O lters. On the one hand,
choosing a chunk size too small may lead to an excessive number
of data blocks, resulting in a lower compression ratio caused by
the reduction of encoding eciency and data locality. Additionally,
smaller chunks can hamper I/O performance as a consequence of
the start-up costs associated with HDF5 and the compressor. On
the other hand, larger chunks may enhance writing eciency but
can also create overhead if the chunk size exceeds the data size
on certain processors. This occurs because the chunking size must
remain consistent across the entire dataset for each process. Strik-
ing a balance between these two factors is essential for achieving
optimal I/O performance and compression eciency in parallel
environments. Moreover, a larger chunking size can result in the
compression of distinct elds of data (e.g., density and velocity)
together. To tackle these challenges, we propose an optimized ap-
proach to modify the AMR data layout and adaptively determine
the maximum chunk size without introducing size overhead issues.
We will detail these methods in §3.3.
AMRIC: A Novel In Situ Lossy Compression Framework for Eicient I/O in Adaptive Mesh Refinement Applications SC ’23, Nov 12–17, 2023, Denver, CO
2.2 Lossy Compression for Scientic Data
Lossy compression is a common data reduction method that can
achieve high compression ratios by sacricing some non-critical in-
formation in the reconstructed data. Compared to lossless compres-
sion, lossy compression often provides much higher compression ra-
tios, especially for continuous oating-point data. The performance
of lossy compression is typically measured by three key metrics:
compression ratio, data distortion, and compression throughput.
The compression ratio refers to the ratio between the original data
size and the compressed data size. Data distortion measures the
quality of the reconstructed data compared to the original data
using metrics such as peak signal-to-noise ratio (PSNR). Compres-
sion throughput represents the size of data that the compressor can
compress within a certain time.
In recent years, several high-accuracy lossy compressors for
scientic oating-point data have been proposed and developed,
such as SZ [
10
,
20
,
32
] and ZFP [
22
]. SZ is a prediction-based lossy
compressor, whereas ZFP is a transform-based lossy compressor.
Both SZ and ZFP are specically designed to compress scientic
oating-point data and provide a precise error-controlling scheme
based on user requirements. For example, the error-bounded mode
requires users to set a type of error-bound, such as absolute error
bound, and a bound value. The compressor then ensures that the
dierences between the original and reconstructed data do not
exceed the error bound.
In this work, we adopt the SZ lossy compressor in our framework
due to its high compression ratio and its modular design that facil-
itates the integration with our framework, AMRIC. Additionally,
the SZ framework includes various algorithms to satisfy dierent
user needs. For example, SZ with Lorenzo predictor [
33
] provides
high compression throughput, while SZ with spline interpolation
[
21
] provides high compression ratio, particularly for large error
bounds. Generally, there are three main steps in prediction-based
lossy compression, such as SZ. The rst step is to predict each
data point’s value based on its neighboring points using a best-t
prediction method. The second step is to quantize the dierence
between the real value and the predicted value based on the user-
set error bound. Finally, customized Human coding and lossless
compression are applied to achieve high compression ratios.
Several prior works have studied the impact of lossy compres-
sion on reconstructed data quality and post-hoc analysis, providing
guidelines on how to set the compression congurations for cer-
tain applications [
11
,
17
–
21
,
33
]. For instance, a comprehensive
framework was established to dynamically adjust the best-t com-
pression conguration for dierent data partitions of a simulation
based on its data characteristics [
18
]. However, no prior study has
established an in situ lossy method for AMR simulations by e-
ciently leveraging these existing lossy compressors. Therefore, this
paper proposes an ecient in situ data compression framework for
AMR simulations to eectively utilize the high compression ratio
from lossy compression.
2.3 AMR Methods and AMR Data
AMR is a technique that tailors the accuracy of a solution by em-
ploying a non-uniform grid, which allows for computational and
storage savings without compromising the desired accuracy. In
Figure 1: Visualization of an up-close 2D slice of three pivotal
timesteps generated by an AMR-based cosmology simulation, Nyx.
As the universe evolves, the grid structure adapts accordingly. The
dashed black and red boxes highlight areas of ner and nest rene-
ment, respectively.
AMR applications, the mesh or spatial resolution is adjusted based
on the level of renement required for the simulation. This involves
using a ner mesh in regions of greater importance or interest and
a coarser mesh in areas of lesser signicance. Throughout an AMR
simulation, meshes are rened according to specic renement
criteria, such as rening a mesh block when its maximum value
surpasses a predetermined threshold (e.g., the average value of the
entire eld), as illustrated in Figure 1.
Figure 1 shows that during an AMR run, the mesh will be rened
when the value meets the renement criteria, e.g., rening a block
when its norm of the gradients or maximum value is larger than a
threshold. By dynamically adapting the mesh resolution in response
to the simulation’s requirements, AMR eectively balances com-
putational eciency and solution accuracy, making it a powerful
approach for various scientic simulations.
Data generated by an AMR application is inherently hierarchical,
with each AMR level featuring dierent resolutions. Typically, the
data for each level is stored separately, such as in distinct HDF5
datasets(groups). For example, Figure 3 (left) presents a simple two-
level patch-based AMR dataset in an HDF5 structure, where “0”
indicates the coarse level (low resolution), and “1” signies the ne
level (high resolution). When users need the AMR data for post-
analysis , they usually convert the data from dierent levels into
a uniform resolution. In the given example, the data at the coarse
level would be up-sampled and combined with the data at the ne
level, excluding the redundant coarse data point “0D”, as illustrated
in Figure 3 (right). This method could also serve to visualize AMR
data without the need for specic AMR visualization tool kits.
There are two primary techniques for representing AMR data:
patch-based AMR and tree-based AMR [
40
]. The key distinction
between these approaches lies in how they handle data redundancy
across dierent levels of renement. Patch-based AMR maintains
redundant data in the coarse level, as it stores data blocks to be
rened at the next level within the current level, simplifying the
SC ’23, Nov 12–17, 2023, Denver, CO Wang et al.
33
H5 PlotFile
Original file format
NaST(2)
Stack(3)
Remove Redundant Data
1D 3D
Keep Source Structure
1D
AMR data
Data w/ Redundancy Data w/o Redundancy
Parallel I/O Parallel I/O
Compression
SZ-Re/Lor SZ_Interp ZFP
decompress decompress
H5 StrucFile
Easier vis/analysis
AMRIC
Parallel
File
System
AdpBlk
Size
SHE
Preprocess
Remove
Redundancy
Truncation
Reorganize
Compressor
Optimization
Unit SLE
compression
Adaptive
SZ-L\R
HDF5 filter
Modification
Data Layout
Changing
Mechanism
Modification
NYX
WarpX
AMReX
Application
. . .
AMR
data
Com-
pressed
AMR
data
Figure 2: Overview of our proposed AMRIC.
Figure 3: A typical example of AMR data storage and usage.
computation involved in the renement process. Conversely, tree-
based AMR organizes grids on tree leaves, eliminating redundant
data across levels. However, tree-based AMR data can be more
complex for post-analysis and visualization when compared to
patch-based AMR data [15].
In this work, we focus on the state-of-the-art patch-based AMR
framework, AMReX, which supports the HDF5 format and compres-
sion lter [
3
]. However, AMReX currently only supports 1D com-
pression, which restricts its ability to leverage higher-dimension
compression. Furthermore, the original compression of AMReX
cannot eectively utilize the HDF5 lter, resulting in low compres-
sion quality and I/O performance (will be described in detail in
§3.3 and Section 5). This limitation serves as motivation for our
proposal of a 3D in situ compression method for AMReX, aimed at
enhancing compression quality and I/O performance.
It is worth noting that the redundant coarser-level data in AM-
ReX (patch-based AMR) is often not utilized during post-analysis
and visualization, as demonstrated in Figure 3 (the coarse point
“0D” will not be used). Therefore, we discard the redundant data
during compression to improve the compression ratio.
3 DESIGN METHODOLOGY
In this section, we introduce our proposed in situ 3D AMR com-
pression framework, AMRIC, using the HDF5 lter, as shown in
Figure 2, with an outline detailed below.
In §3.1, we rst propose a pre-processing approach for AMR data,
which includes the elimination of data redundancy, uniform trun-
cation of data, and reorganization of truncated data blocks tailored
to the requirements of dierent compressors including dierent
SZ compression algorithms. In §3.2, we further optimize the SZ
compressor’s eciency for compressing AMR data by employing
Shared Lossless Encoding (SLE) and dynamically determining the
ideal block sizes for the SZ compressor, taking into account the
specic characteristics of AMR data. In §3.3, we present strategies
to overcome the obstacles between the HDF5 and AMR applications
by modifying the AMR data layout as well as the HDF5 compres-
sion lter mechanism, which subsequently results in a signicant
improvement in both compression ratio and I/O performance, as
discussed in §2.1.
3.1 Pre-processing of AMR Data
As mentioned in §2.3, in the patch-based AMR dataset generated
by AMReX, the data in the coarse level could be removed to easily
improve the compression ratio and I/O performance because there
would be fewer data to be processed. Patch-based AMR divides
each AMR level’s domain into a set of rectangular boxes. Figure 4
(right) illustrates an example of an AMR dataset with three total
levels. In the AMReX numbering convention, the coarsest level
is designated as level 0. There are 4, 3, and 3 boxes on levels 0,
1, and 2, respectively. Bold lines signify box boundaries. The four
coarsest boxes (black) cover the whole domain. There are three
intermediate-resolution boxes (blue) at level 1 with cells that are
two times ner than those at level 0. The three nest grids (red) at
level 2 have cells that are twice as ne as those in level 1.
Clearly, there are overlapping areas between the dierent AMR
levels: the coarsest level 0 overlaps with the ner level 1, and level
1 also has overlapping regions with the nest level 2. Taking level 1
as an example, we can eliminate the redundant coarse regions that
overlap with level 2. It is worth noting that AMReX oers ecient
functions for box intersection, which can be employed to identify
these overlapping areas. These functions are signicantly faster
than a naive implementation, resulting in reduced time costs [
4
].
Furthermore, there is no need to record the position of the empty
regions in the compressed data, as the position of the empty regions
in level 1 can be inferred using the box position of level 2, which
introduces minimal overhead to the compressed data size.
The challenge in compressing 3D data is that the boxes will have
varying shapes, particularly after redundancy removal, as shown in
Figure 4, especially in larger datasets. To tackle this irregularity, we
propose a uniform truncation method that partitions the data into
a collection of unit blocks. This approach facilitates the collective
compression of boxes, irrespective of their varying and irregular
shapes. This method not only boosts encoding eciency but also
reduces the compressor’s launch time by eliminating the need to
call the compressor separately for each unique box shape.
Subsequently, the generated unit blocks after truncation, as men-
tioned, can be rearranged based on the needs of a specic compres-
sor to improve compression performance. In this work, we focus
on the SZ2 compression algorithm with the Lorenzo and linear
regression predictors (denoted by “SZ_L/R”) [
20
], and the SZ com-
pression algorithm with the spline interpolation approach (denoted
by “SZ_Interp”) [
42
]. Specically, the SZ_L/R will rst truncate the
whole input data to blocks with the size of 6
×
6
×
6, and then perform
Lorenzo predictor or high-dimensional linear regression on each
block separately. For the SZ_L/R, we will linearize the truncated
AMRIC: A Novel In Situ Lossy Compression Framework for Eicient I/O in Adaptive Mesh Refinement Applications SC ’23, Nov 12–17, 2023, Denver, CO
…
…
SZ_Lor/Reg
SZ_Interp
Remove
Redundancy Reorganize
Level_1 w/o
Redundancy
Original Data
Unit
Blocks
Partition …
Figure 4: An example of our proposed 3D pre-processing workow (in a top-down 2D view).
50 100 150 200 250 300
Compression Ratio
75
80
85
90
95
PSNR
Fine-Level
Linear
Cluster
(a) Fine level (density = 17.4%)
10 20 30 40 50 60 70
Compression Ratio
50
60
70
PSNR
Coarse-Level
Linear
Cluster
(b) Coarse level (density = 82.3%)
Figure 5: Rate-distortion comparison between linear and cluster
arrangements across dierent levels for Nyx’s "baryon density" eld.
The considered relative error bounds range from 2
×
10
−2
to 3
×
10
−4
.
unit blocks as shown in the top right of Figure 4 (put the unit blocks
along the z-axis for 3D data) because this will oer the minimum
operation during the organization, thus saving time.
The SZ_Interp will perform interpolation across all three di-
mensions of the entire dataset. Given that interpolation is a global
operation, one potential solution to improve interpolation accuracy
is to cluster the truncated unit blocks more closely into a cube-
like formation, as depicted in the bottom right part of Figure 4.
This conguration helps balance the interpolation process across
multiple dimensions, thus signicantly improving the compression
performance. As depicted in Figure 5, organizing unit blocks in
a more compact cluster arrangement leads to enhanced overall
compression performance concerning rate-distortion (PSNR
2
ver-
sus compression ratio) when compared to a linear arrangement
of the blocks. This improvement is particularly noticeable when
the compression ratio is relatively high. The test data is generated
from a Nyx Run featuring two renement levels: a coarse level with
256
3
grids and a ne level containing 512
3
grids. After eliminating
2
PSNR is calculated as 20
·log10 𝑅−
10
·log10 Í𝑁
𝑖=1𝑒2
𝑖/𝑁
, where
𝑒𝑖
is the absolute
error for the point
𝑖
,
𝑁
is the number of points, and
𝑅
is the value range of the dataset.
0 128 256 384 512
0
128
256
384
512
SLE 0 128 256 384 512
0
128
256
384
512
Linear Merging
0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09 2.5e+09 3.0e+09 3.5e+09
Figure 6: Visualization comparison (one slice) of absolute compres-
sion errors of unit SLE (left, CR = 91.4) and original linear merging
(right, CR = 86.1) on Nyx “baryon density” eld (i.e., ne level, 18%
density, unit block size = 16). Bluer means higher compression error.
redundant coarse data, the coarse level has a data density of 82.3%,
while the ne level has a data density of 17.4%. Here, data density
refers to the proportion of data saturation within the entire domain.
3.2 Optimization of SZ_L/R Compression
The pre-processed AMR data, however, faces two signicant chal-
lenges that prevent it from achieving optimized compression quality
when using the original SZ_L/R compressor. To overcome these
challenges and further improve compression performance, we pro-
pose optimizing the SZ_L/R compressor. In the following para-
graphs, we will outline the two challenges and describe our pro-
posed solutions to address them eectively.
Challenge 1: Low prediction accuracy of SZ_L/R on AMR data. As
discussed in §3.1, the truncated unit blocks are linearized and sent
to the SZ_L/R compressor. However, some merged small blocks
may not be adjacent in the original dataset, resulting in poor data
locality/smoothness between these non-neighboring blocks. This
negatively aects the accuracy of SZ_L/R’s predictor. An intuitive
solution would be to compress each box individually. But, trunca-
tion can produce a large number of small data blocks (e.g., 5,000+),
causing the SZ_L/R to struggle on small datasets due to low en-
coding eciency, as mentioned in §2.1. This is because the SZ
compressor utilizes thousands of Human trees to encode these
SC ’23, Nov 12–17, 2023, Denver, CO Wang et al.
50 100 150 200 250 300
Compression Ratio
80
85
90
95
100
PSNR
Fine-level
Adp-4
SLE
LM
1D
(a) Fine level (unit block size=16, density = 17.4%)
20 30 40 50 60 70 80 90
Compression Ratio
35
40
45
50
PSNR
Coarse-Level
Adp-4
SLE
LM
1D
(b) Coarse level (unit block size=8, density = 82.3%)
Figure 7: Rate-distortion comparison between LM, SLE, adaptive
SZ_L/R, and 1D compression, across dierent levels for Nyx’s "baryon
density" eld. The relative error bound ranges from 2
×
10
−2
to 3
×
10
−4
.
small blocks separately, leading to decreased encoding eciency.
In conclusion, the original SZ_L/R faces a dilemma: either predict
and encode small blocks collectively (by merging them), which
compromises prediction accuracy, or predict and encode each small
block individually, incurring high Human encoding overhead.
Solution 1: Improve prediction using unit SLE. To address Chal-
lenge 1, we propose using the Shared Lossless Encoding (SLE) tech-
nique in SZ_L/R. This method allows for separate prediction of
unit data blocks while encoding them together with a single shared
Human tree. Specically, each unit block is initially predicted and
quantized individually. Afterward, the quantization codes and re-
gression coecients from each unit block are combined to create a
shared Human tree and then encoded. This approach improves the
prediction performance of SZ_L/R without signicantly increasing
the time overhead during the encoding process.
As shown in Figure 6, the unit SLE notably reduces overall com-
pression error in comparison to the original linear merging (LM),
especially for data located at the boundaries of data blocks. As a
result, this leads to a substantial improvement in rate distortion, as
illustrated in Figure 7a. Note that the data used for testing in this
section is the same as in §3.1. Specically, the unit block size for
the ne level is 16, while the unit block size for the coarse level is 8.
Challenge 2: Unit SLE may produce undesirable residues. As
previously mentioned, the input data is truncated into 6
×
6
×
6 blocks
by the SZ_L/R compressor for separate processing. This block size
was chosen to balance between prediction accuracy and metadata
overhead, achieving optimal overall compression quality.
When using unit SLE, the compressor will further partition each
of these unit blocks. The issue is that the unit block size of data
produced by AMReX is typically a power of two (i.e., 2
𝑛
), which is
(a) Original partition (b) Adaptive partition
Figure 8: Example of the original partition and adaptive partition
of SZ_L/R on a unit block with the size of 8
×
8
×
8; the gray boxes
represent data that are dicult to compress.
not evenly divisible by 6. As a result, using a 6
×
6
×
6 cube to truncate
unit blocks with specic sizes may leave undesirable residues that
impact compression quality. For example, if the unit block is 8
×
8
×
8,
as shown in Figure 8, SZ_L/R with unit SLE will further divide it into
smaller blocks with sizes of 6
×
6
×
6 (one block), 6
×
6
×
2 (three “at”
blocks), 6
×
2
×
2 (three “slim” blocks), and 2
×
2
×
2 (one “tiny” block)
as shown in Figure 8a. While the data in the 6
×
6
×
2, 6
×
2
×
2, and
2
×
2
×
2 blocks is almost attened/collapsed to 2D data, 1D data, and
a single point, respectively, rather than preserving 3D data features.
These “low-dimension" data blocks can greatly aect the prediction
accuracy of SZ_L/R, as they cannot leverage high-dimensional
topological information. As shown in Figure 7b, when the unit block
size is 8, the unit SLE approach does not appear to signicantly
improve the performance over the original LM method.
Solution 2: SZ_L/R with adaptive block size. To address the issue
of residue blocks that are dicult to compress, we propose an
adaptive approach for selecting the block size used by the SZ_L/R
compressor based on the unit block size of the AMR data. Equation 1
describes the adaptive block size selection method.
SZ_BlkSize =
4×4×4,if unitBlkSize mod 6 ≤2;
6×6×6,if unitBlkSize mod 6 >2;
6×6×6,if unitBlkSize ≥64;
(1)
0 64 128 192 256
0
64
128
192
256
Adaptive Block Size 0 64 128 192 256
0
64
128
192
256
SLE
0.0e+00 2.0e+08 4.0e+08 6.0e+08 8.0e+08 1.0e+09
Figure 9: Visualization comparison (one slice) of compression errors
of the adaptive block size (left, CR=39.8) and unit SLE (right, CR=38.8)
and on Nyx “baryon density” eld (i.e., coarse level, 82% density, unit
block size = 8). Bluer means higher compression error.
AMRIC: A Novel In Situ Lossy Compression Framework for Eicient I/O in Adaptive Mesh Refinement Applications SC ’23, Nov 12–17, 2023, Denver, CO
. . . . . . . . . . . . . .
.
. . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . .
(a) Original data
. . . . . . . . . . . . . .
.
. . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . .
(b) Original SZ_L/R, CR =51.7
. . . . . . . . . . . . . .
.
. . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . .
(c) AMRIC SZ_L/R, CR =53.2
Figure 10: Visualization comparison (one slice) of original (uncompressed) data and decompressed data produced by original SZ_L/R and
AMRIC’s optimized SZ_L/R on Nyx’s “baryon density” eld (with 2 levels, 18% density for ne level). Warmer colors indicate higher values. The
red dotted lines denote the boundaries between AMR levels, and the white arrows in Figure 10b highlight the artifacts between two AMR levels.
Specically, if the remainder of the unit block size divided by
the original SZ_L/R block size is less than or equal to 2, there will
be undesirable residue blocks. In such cases, we adjust the SZ_L/R
block size to be 4
×
4
×
4to avoid compressing these blocks with
low compressibility and to improve prediction quality. Conversely,
if the remainder is greater than 2, we use the original block size of
6
×
6
×
6. For example, as shown in Figure 8b, for the 8
×
8
×
8 unit
block, we have 8
𝑚𝑜𝑑
6
=
2, and we will select the SZ_L/R block to
be 4
3
. Although using an SZ_L/R block size of 4
3
results in higher
metadata overhead, the increased prediction accuracy compensates
for it, achieving compression performance comparable to that of
6
3
while avoiding the undesirable residue issue. Figure 9 illustrates
that the adaptive block size approach (i.e., Adp-4) can signicantly
reduce compression errors compared to the SLE approach, leading
to a considerable enhancement in rate-distortion, as shown in Fig-
ure 7b. On the other hand, for example, when the unit block size is
16, we do not have the undesirable residues issue and there is no
need to use an adaptive block size approach since it does not have
an obvious advantage over the SLE approach, as shown in Figure 7a.
Furthermore, note that when the unit block size is relatively large
(i.e., larger than 64, which is not common for AMR data), we retain
the original SZ_L/R block size. This is because even if there are
undesirable residue blocks, they only occupy a small portion of the
dataset, while the compression performance using 6
3
is slightly
better than using 4
3
, osetting the negative eect of the few residue
blocks. Note that we did not select the SZ_L/R block size to be 8
3
to eliminate the undesirable residue because it will signicantly
reduce compression quality.
Improvement in Visualization Quality: It is worth noting
that compared to the original SZ_L/R, AMRIC’s optimized SZ_L/R
notably enhances the visualization quality of AMR data, particularly
in areas with intense data uctuations. For example, as shown in
Figure 10, our optimized SZ_L/R eectively reduces the artifacts
at the boundaries between dierent AMR levels (denoted by the
white dotted lines), which were previously caused by the original
SZ_L/R (as indicated by the white arrows in Figure 10b).
3.3 Modication of HDF5 Compression Filter
In this work, we use the HDF5 compression lter to enhance I/O
performance and increase usability. However, as mentioned in §2.1,
there are barriers between the HDF5 lter and the AMR application.
Specically, when employing the compression lter, HDF5 splits the
in-memory data into multiple chunks, with lters being applied to
each chunk individually. This makes it challenging to determine an
appropriate large chunk size in order to improve the compression
ratio and I/O performance. We face two primary obstacles when
attempting to use a larger chunk size, and we will discuss each of
these challenges along with their solutions.
Challenge 1: AMR data layout issue for multiple elds. As dis-
cussed in §3.1, AMReX (Patch-based AMR) divides each AMR level’s
domain into a collection of rectangular boxes/patches, with each
box typically containing data from multiple elds. Consequently,
in the AMReX framework, data corresponding to various elds
within each box are stored continuously, rather than being stored
in a separate manner. For instance, as illustrated in the upper por-
tion of Figure 11, we have three boxes and two elds (i.e., Temp
for temperature and Vx for velocities in the
𝑥
direction), and the
data for Temp and Vx in dierent boxes are placed together. In this
situation, when determining the HDF5 chunk size, it cannot exceed
the size of the smallest box (i.e., Box-1).
This limitation arises because we want to avoid compressing
dierent elds together using lossy compression, as it would lead
to the usage of identical error bounds across various elds, de-
spite their potentially signicant dierences in value ranges. Addi-
tionally, combining data from distinct elds can compromise data
smoothness and negatively impact the compressor’s performance.
As a result, the original AMReX could only utilize a small chunk
size (i.e., 1024), which signicantly increased the encoding overhead
of compressing each small chunk separately and led to a lower
compression ratio. Moreover, the compressor had to be called for
each small chunk, substantially raising the overall startup cost of
the compressor and adversely aecting I/O performance.
Solution 1: Change data layout. A potential solution to this is-
sue is to separate data from dierent elds into distinct buers for
compression. However, this approach requires compressing and
SC ’23, Nov 12–17, 2023, Denver, CO Wang et al.
Figure 11: An example of an AMR dataset with two elds and three
boxes, illustrating our data layout modication that applies a larger
chunk size (indicated by the grey dashed line).
writing multiple buers into multiple HDF5 datasets simultane-
ously, resulting in reduced performance for HDF5 collective writes.
Based on our observations, compressing and writing AMR data into
multiple HDF5 datasets can be up to 5
×
slower than processing
them collectively.
To address this problem, we propose continuing to compress
and write data into a single HDF5 dataset, while modifying the
data layout to group data from the same eld of each box together,
as depicted in the lower portion of Figure 11. This method allows
us to increase the chunk size and compress the entire eld as a
whole. It is important to note that we achieved this by altering the
loop access order when reading the data into the buer, which adds
minimal time overhead, rather than reorganizing the buer itself.
By increasing the chunk size, we can signicantly enhance both
the compression and I/O performance (will be shown in §4).
Challenge 2: Load imbalance for AMR data. Another challenge
that prevents the adoption of a large chunk size is the load imbalance
issue for AMR data across multiple processes. Given that the entire
HDF5 dataset has to use the same chunk size, selecting an optimal
global chunk size in a parallel scenario becomes dicult, as the
data size on each MPI rank may vary.
For example, as shown in Figure 12, we have 4 ranks that hold
dierent data in the memory. Without loss of generality and for
clearer demonstration, we suppose there is only one eld. If we set
the chunk size to be the largest data size in all the ranks (i.e., rank
1), there would be overhead in the other 3 ranks (i.e., we have to
pad useless data onto the other 3 ranks). Clearly, This will make
the compressor handle extra data and impact the compression ratio
as well as the I/O time.
Another intuitive solution to this is to let each rank write its
data to its own dataset. In this way, each rank does not have to
use the same global chunk size and can select its own chunk size
based on the data size. The problem is that due to the usage of the
lter, HDF5 has to perform the collective write, which means all
processes need to participate in creating and writing each dataset.
For example, when rank 0 is writing its data to dataset 0, the other
3 ranks also need to participate even if they have no data to be
written to dataset 0. As a result, the other 3 ranks will be idle and
wait for the current write (to dataset 0) to nish before proceeding
with its own write, causing a serial write with poor performance.
Solution 2: Modify the HDF5 lter mechanism. To address the
above issue, we propose to still use the global chunk size, which is
equal to the largest data size across all ranks. However, we modify
the compression lter and provide the actual data size of each
Figure 12: An example with four MPI ranks that hold a dierent
amount of AMR data to demonstrate our proposed chunk size selec-
tion strategy. We select the chunk size to be the data size of rank 1
(outer blue box) while passing the actual data size to the compression
lter (enclosing magenta dashed box).
rank (as shown in the magenta dashed box in Figure 12) to the
compression lter before the compression process.
It should be noted that we also need to store metadata such as
the value of the original data size for each rank for decompression
purposes. The metadata overhead is minimal since the data com-
ponent far outweighs the metadata to be written. This results in
nearly no size overhead while improving the overall compression
ratio and I/O time by adapting the biggest possible chunk size.
4 EXPERIMENTAL EVALUATION
4.1 Experimental Setup
AMR applications. Our evaluation primarily focuses on two AMR
applications developed by the AMReX framework [
41
]: Nyx cos-
mology simulation [
27
] and the WarpX [
12
] electromagnetic and
electrostatic Particle-In-Cell (PIC) simulation. Nyx, as shown in
Figure 13, is a cutting-edge cosmology code that employs AMReX
and combines compressible hydrodynamic equations on a grid with
a particle representation of dark matter. Nyx generates six elds,
including baryon density, dark matter density, temperature, and
velocities in the
𝑥
,
𝑦
, and
𝑧
directions. WarpX, as shown in Figure 14,
is a highly-parallel and highly-optimized code that utilizes AMReX,
runs on GPUs and multi-core CPUs, and features load-balancing ca-
pabilities. WarpX can scale up to the world’s largest supercomputer
and was the recipient of the 2022 ACM Gordon Bell Prize [29].
Test platform. Our test platform is the Summit supercom-
puter [
28
] at Oak Ridge National Laboratory, each node of which is
equipped with two IBM POWER9 processors with 42 physical cores
and 512 GB DDR4 memory. It is connected to an IBM Spectrum
Scale lesystem [
38
]. We use up to 128 nodes and 4096 CPU cores.
Comparison baseline. We compare our solution with AMReX’s
1D SZ_L/R compression solution [
3
] (denoted by “AMReX”). We
exclude zMesh and TAC because they are not in situ compression
solutions, as mentioned in §1. Note that we evaluate AMRIC with
both SZ_L/R (SZ with Lorenzo and linear regression predictors)
and SZ_Interp (SZ with spline interpolation predictor).
Test runs. As shown in Table 1, we have conducted six sim-
ulation runs in total, with three runs for each of the scientic
applications, WarpX (WarpX_1, WarpX_2, and WarpX_3) and Nyx
(Nyx_1, Nyx_2, and Nyx_3). Each simulation run consists of two
levels, and the number of nodes (ranks) varying from 2 (64), to
16 (512), up to 128 (4096). In WarpX_1, the grid sizes of the levels
progress from coarse to ne, with the dimensions of 256
×
256
×
2048
and 512
×
512
×
4096, and the data densities of 98.04% and 1.96%, re-
spectively. The data size for one timestep in this run is 12.4 GB. For
AMRIC: A Novel In Situ Lossy Compression Framework for Eicient I/O in Adaptive Mesh Refinement Applications SC ’23, Nov 12–17, 2023, Denver, CO
Table 1: Detailed information about our tested AMR runs.
Runs #AMR Levels #Nodes
(#MPI ranks)
Grid size of each level
(coarse to ne)
Density of each level
(coarse to ne)
Data size
(each timestep)
Error bound
(AMRIC and AMReX)
Warpx_1 2 2 (64) 256×256×2048, 512×512×4096 98.04%, 1.96% 12.4 GB 1E-3, 5E-3
Warpx_2 2 16 (512) 512×512×4096, 1024×1024×8192 98.05%, 1.96% 99.3 GB 1E-3, 5E-3
Warpx_3 2 128 (4096) 1024×1024×8192, 2048×2048×16384 98.96%, 1.04% 624 GB 1E-4, 5E-4
Nyx_1 2 2 (64) 256×256×256, 512×512×512 98.6%, 1.4% 1.6 GB 1E-3, 1E-2
Nyx_2 2 16 (512) 512×512×512, 1024×1024×1024 96.67%, 3.23% 12 GB 1E-3, 1E-2
Nyx_3 2 128 (4096) 1024×1024×1024, 2048×2048×2048 98.3%, 1.7% 97.5 GB 1E-3, 1E-2
Figure 13: Visualization of baryon density eld of Nyx.
Figure 14: Visualization of the electric eld (x-direction) of WarpX.
WarpX_2, the grid sizes are 512
×
512
×
4096 and 1024
×
1024
×
8192,
accompanied by the data densities of 98.05% and 1.96%. The data
size for one timestep is 99.3 GB. For WarpX_3, the grid sizes are
1024
×
1024
×
8192 and 2048
×
2048
×
16384, accompanied by the
data densities of 98.96% and 1.04%. The data size for one timestep is
624 GB. Regarding Nyx_1, the grid sizes of the levels, from coarse
to ne, are 256
×
256
×
256 and 512
×
512
×
512, with data densities
of 98.6% and 1.4%, respectively. The data size for one timestep in
this run is 1.6 GB. For Nyx_2, the grid sizes are 512
×
512
×
512 and
1024
×
1024
×
1024, and the data densities are 96.67% and 3.23%. The
data size for one timestep is 12 GB. For Nyx_3, the grid sizes are
1024
×
1024
×
1024 and 2048
×
2048
×
2048, and the data densities are
98.3% and 1.7%. The data size for one timestep is 97.5 GB. Finally,
as Summit uses a shared parallel lesystem that uctuates in per-
formance depending on the I/O load across the overall user-base,
we run each set of simulation runs multiple times and discard the
results with abnormal performance (extremely slow).
4.2 Evaluation on Compression Ratio
As demonstrated in Table 2, our method, which includes optimized
SZL/R and optimized SZ_Interp, outperforms the original 1D base-
line for all three runs of the two applications, with a particularly
notable improvement in WarpX. This superior performance is pri-
marily due to our 3D compression’s ability to leverage spatial and
topological information, thereby enhancing the compression pro-
cess. Moreover, the optimizations for SZ_L/R and SZ_Interp out-
lined in §3 further improve their respective compression perfor-
mances. Furthermore, as mentioned in §3.3, the small chunk size of
the original AMReX’s compression leads to substantial encoding
overhead, which ultimately results in a lower compression ratio.
Upon further analysis, we observe that WarpX exhibits a notably
high compression ratio. This is mainly due to the smooth nature
of the data generated by WarpX, as depicted in Figure 14, which
results in excellent compressibility. In contrast, the data produced
by Nyx appears irregular, as illustrated in Figure 13, making it more
challenging to compress. Consequently, our AMRIC method has the
potential to signicantly reduce I/O time for WarpX by achieving
greater data size reduction. Importantly, despite the diculty in
compressing Nyx data, our AMRIC will not introduce signicant
overhead to Nyx simulation, as will be demonstrated in §4.4.
In terms of specic performance, SZL/R proves more eective in
handling Nyx data, while SZ_Interp delivers superior compression
ratios for WarpX. This can be attributed to SZL/R’s block-based
predictor, which is better suited to capturing local patterns within
Nyx data, whereas SZ_Interp’s global interpolation predictor excels
when applied to the overall smoother data produced by WarpX.
Table 2: Comparison of compression ratio (averaged across all
elds/timesteps) with AMReX’s original compression and AMRIC.
Run AMReX(1D) AMRIC(SZ_L/R) AMRIC(SZ_Interp)
WarpX_1 16.4 267.3 482.1
WarpX_2 117.5 461.2 2406.0
WarpX_3 29.6 949.0 4753.7
Nyx_1 8.8 15.0 14.0
Nyx_2 8.8 16.6 14.2
Nyx_3 8.7 16.3 13.6
4.3 Evaluation on Reconstruction Data Quality
As shown in Table 3, the reconstruction data quality of AMRIC
(including both SZ_L/R and SZ_Interp) is greatly higher than that
of AMReX due to our optimization as well as the benet of the 3D
compression. As shown in Figure 15, the error generated by AMRIC
is considerably lower than that of AMReX. Note that the existing
block-like pattern of the error is from the parallel compression done
by 512 processes, each assigned with an error bound relative to
the data range in the corresponding block. Therefore, the absolute
error of each block is independent of those of the other blocks.
Comparison with oline solution TAC [
39
]: To further demon-
strate the eectiveness of AMRIC’s optimized SZ_L/R, we conduct
Table 3: Comparison of reconstruction data quality (in PSNR) with
AMReX’s original compression and AMRIC for dierent runs.
Run AMReX(1D) AMRIC(SZ_L/R) AMRIC(SZ_Interp)
Nyx_1 52.5 66.8 66.5
Nyx_2 56.7 69.1 68.9
Nyx_3 54.9 68.3 68.0
WarpX_1 73.6 80.3 79.9
WarpX_2 78.5 83.8 88.7
WarpX_3 82.5 97.9 103.1
SC ’23, Nov 12–17, 2023, Denver, CO Wang et al.
0 128 256 384 512
0
128
256
384
512
AMRIC-SZ-L/R
0 128 256 384 512
0
128
256
384
512
AMReX
0.0e+005.0e+08 1.0e+09 1.5e+09 2.0e+092.5e+093.0e+09 3.5e+09
Figure 15: Visualization comparison (one slice) of compression er-
rors of our AMRIC (left) and AMReX’s compression (right) on Nyx_2
“baryon density” eld (i.e., coarse level, 96% density, unit block
size=32). Bluer/darker means higher compression error.
50 100 150 200 250 300
Compression Ratio
80
85
90
95
100
PSNR (dB)
TAC
AMRIC
Figure 16: Rate-distortion comparison of TAC and AMRIC using
TAC’s dataset [16] (i.e., Run1_Z10).
a comparison with a state-of-the-art oine compression approach
for 3D AMR data, called TAC (will be introduced in §5), using the
dataset from TAC’s work. As depicted in Figure 16, AMRIC outper-
forms TAC in terms of compression quality, achieving up to a 2.2
×
higher compression ratio while maintaining the same PSNR. This
superior performance can be attributed to the fact that, unlike TAC,
which only focuses on pre-processing and uses SZ_L/R as a black
box, AMRIC optimizes both the pre-processing and SZ_L/R.
Insight of dierent compressors’ performance on AMR simulations.
One takeaway is that, compared with SZ_Interp, SZ_L/R is more
suitable for compressing AMR data because both AMR data and
SZ_L/R are block-based, while SZ_Interp is global. Specically,
AMR simulations divide the data into boxes, which can negatively
impact data locality and smoothness. SZ_L/R, on the other hand,
also truncates data into blocks, aligning with AMR simulations. By
applying our optimization approach outlined in §3.2, SZ_L/R with
SLE and adaptive block size can eectively mitigate the impact on
data locality/smoothness caused by AMR applications, resulting in
ideal compatibility with AMR applications. On the other hand, since
SZ_Interp applies global interpolation to the unstructured block-
based AMR data, it is challenging to achieve perfect compatibility
between SZ_Interp and AMR applications.
4.4 Evaluation on I/O Time
The overall writing time consists of: (1) pre-processing (including
copying data to the HDF5 buer, handling metadata, calculating the
oset for each process) and I/O time without compression (directly
writing the data to the disk); or (2) pre-processing and I/O time
with compression (including compression computation cost and
writing the compressed data to the le system).
512 1024 2048
100
101
102
Time (Seconds)
NoComp
512 1024 2048
AmReX
512 1024 2048
AMRIC (SZ-L/R)
512 1024 2048
AMRIC (SZ-Interp)
I/O time
Prep.
WarpX: Write-Time Breakdown
Figure 17: Writing time of WarpX runs with dierent scales (in a
weak scaling study). Log scale is used here for better comparison.
256 512 1024
100
101
102
Time (Seconds)
NoComp
256 512 1024
AmReX
256 512 1024
AMRIC (SZ-L/R)
256 512 1024
AMRIC (SZ-Interp)
I/O time
Prep.
Nyx: Write-Time Breakdown
Figure 18: Writing time of Nyx runs with dierent scales. Log scale
is used for better comparison.
Figure 17 shows that our in situ compression signicantly re-
duces the overall writing time for HDF5 when compared to writing
data without compression. This reduction reaches up to 90% for the
largest-scale WarpX run3, and 64% for the larger-scale WarpX run2,
without introducing any noticeable overhead for the smaller-scale
WarpX run1. This is because, for the relatively large-scale WarpX
run3 and run2, the total data to be written amounts to 624 GB
and 99 GB respectively. Due to the high compression ratio, we can
signicantly save both writing time and overall processing time.
Our approach also considerably reduces the total writing time
by 97% compared to the original AMReX’s compression for WarpX
run1, 93% for WarpX run2, and 89% for WarpX run3. We can nd
out that AMReX’s compression is extremely slow on WarpX. This
is because, rst, as discussed in §3.3 (Challenge 1), the original
compression of AMReX is constrained by the existing data lay-
out, which necessitates the use of a small HDF5 chunk size. This
constraint negatively impacts compression time and ratio, caus-
ing suboptimal I/O performance. Moreover, this negative impact
becomes more severe when there are relatively larger amounts of
data in each process. Specically, for all three WarpX runs, each
process contains at least 128
3
data points (considering only the
coarse level). Consequently, given that the HDF5 chunk size for
the original AMReX compression is 1024, each process will call the
compressor 2048 times, generating a substantial startup cost and
leading to extremely slow I/O. For WarpX run3, an HDF5 chunk
size of 1024 causes issues, so we instead use 4096 as the chunk size.
This issue further emphasizes the signicance of our modications
presented in §3.3, which eectively utilize the HDF5 compression
lter. It is worth noting that the impact of this small chunk issue
will be relatively mitigated when there are fewer data points in each
process (as demonstrated in the subsequent evaluation of Nyx).
AMRIC: A Novel In Situ Lossy Compression Framework for Eicient I/O in Adaptive Mesh Refinement Applications SC ’23, Nov 12–17, 2023, Denver, CO
Also, note that our proposed method does not introduce any
signicant overhead to the pre-processing time. This is primar-
ily because our pre-processing strategy is lightweight, employing
AMReX’s built-in functions to identify redundant coarse data. Fur-
thermore, the truncation, block ordering in the pre-processing stage,
and data layout changes in §3.3 are performed simultaneously when
loading data into the compression buer, eliminating the need for
extra rearrangement operations. In addition, the elimination of
redundant data using the pre-processing workow reduces the
amount of data to be processed, lowering the pre-processing time.
To further demonstrate the performance of our method on data
with low compressibility as well as when each process owns fewer
data, we conduct our test using Nyx with a smaller scale but the
same number of nodes (ranks). In our Nyx runs, each process owns
64
×
64
×
64 data points in the coarse level, which is 8 times less
than that of WarpX runs (i.e., 128
×
128
×
128). Note that a smaller
data size in each process will lower the overall lossless encoding
eciency of the data, thus aecting the compression ratio.
As shown in Figure 18, even in a challenging setup (i.e., low data
compressibility and low encoding eciency), AMRIC is still able to
achieve writing speeds comparable to those with no compression
for all three Nyx runs. Furthermore, AMRIC signicantly reduces
the total writing time by 79% compared to the original AMReX’s
compression for Nyx run1, by 53% for Nyx run2, and by 64% for
Nyx run3. It is worth noting that the small chunk issue in the Nyx
run is relatively mitigated compared to WarpX. Specically, we
observe that the writing time using AMReX’s compression for both
2-node and 16-node runs on Nyx is reduced by approximately 55
seconds, while the reduction for the 128-node run is approximately
10 seconds. Our interpretation is that the time taken to launch the
compression once remains constant (e.g., 0.03 seconds). In the Nyx
run, each process needs to call the compressor only 256 times, as
opposed to the 2048 calls required in the WarpX run1 and run2 (due
to the increased HDF5 chunk size as aforementioned, WarpX run3
only requires 512 calls.). This dierence results in a time reduction
of
(
2048
−
128
) ∗
0
.
03
≈
55 seconds for the writing process for
WarpX run1 and run2, and
(
512
−
128
) ∗
0
.
03
≈
10 for WarpX run3.
5 RELATED WORK
There are two recent works focusing on designing ecient lossy
compression methods for AMR datasets. Specically, zMesh, pro-
posed by Luo et al. [
26
], aims to leverage the data redundancy across
dierent AMR levels. It reorders the AMR data across dierent re-
nement levels in a 1D array to improve the smoothness of the
data. To achieve this, zMesh arranges adjacent data points together
in the 1D array based on their physical coordinates in the original
2D/3D dataset. However, by compressing the data in a 1D array,
zMesh cannot leverage higher-dimension compression, leading to a
loss of topology information and data locality in higher-dimension
data. On the other hand, TAC, proposed by Wang et al. [
39
], was
designed to improve zMesh’s compression quality through adaptive
3D compression. Specically, TAC pre-processes AMR data before
compression by adaptively partitioning and padding based on the
data characteristics of dierent AMR levels.
While zMesh and TAC provide oine compression solutions
for AMR data, they are not designed for in situ compression of
AMR data. In particular, zMesh requires extra communication to
perform reordering in parallel scenarios. Specically, zMesh must
arrange neighboring coarse and ne data more closely. However,
the neighboring ne and coarse data might not be owned by the
same MPI rank. As a result, data from dierent levels must be trans-
ferred to the appropriate processes, leading to high communication
overhead. TAC requires the reconstruction of the entire physical
domain’s hierarchy to execute its pre-processing approach. This
relatively complex process results in signicant overhead for in situ
data compression.
Although the AMReX framework [
41
] supports in situ AMR
data compression through HDF5 compression lters [
3
], it has
two main drawbacks: (1) The original AMReX compression only
compresses the data in 1D, limiting its capacity to benet from
higher-dimension compression and resulting in sub-optimal com-
pression performance, particularly in terms of compression quality.
(2) The original AMReX compression cannot eectively utilize the
HDF5 lter. Specically, due to the limitation of the data layout
for multiple elds, AMReX can only adopt a very small chunk size
to prevent compressing dierent physical elds together. As a re-
sult, AMReX needs to apply the compressor separately for each
small chunk, resulting in low I/O and compression performance as
demonstrated in §4.4, §4.2, and §4.3.
6 CONCLUSION AND FUTURE WORK
In conclusion, we have presented AMRIC, an eective in situ lossy
compression framework for AMR simulations that signicantly en-
hances I/O performance and compression quality. Our primary con-
tributions include designing a compression-oriented pre-processing
in situ workow for AMR data, optimizing the state-of-the-art
SZ lossy compressor for AMR data, eciently utilizing the HDF5
compression lter on AMR data, and integrating AMRIC into the
AMReX framework. We evaluated AMRIC on two real-world AM-
ReX applications, WarpX and Nyx, using 4096 CPU cores from
the Summit supercomputer. The experimental results demonstrate
that AMRIC achieves up to 10.5
×
I/O performance improvement
over the non-compression solution and up to 39
×
I/O performance
improvement and up to 81
×
compression ratio improvement with
better data quality over the original AMReX’s compression solution.
In future work, we plan to evaluate AMRIC on additional AM-
ReX applications accelerated by GPUs. Furthermore, we will assess
AMRIC on a wider range of HPC systems and at dierent scales.
Additionally, we will incorporate our in situ compression solution
into other AMR frameworks.
ACKNOWLEDGEMENT
This work (LA-UR-23-24096) has been authored by employees of Triad Na-
tional Security, LLC which operates Los Alamos National Laboratory under
Contract No. 89233218CNA000001 with the U.S. Department of Energy and
National Nuclear Security Administration. The material was supported by
the U.S. Department of Energy, Oce of Science and Oce of Advanced Sci-
entic Computing Research (ASCR), under contract DE-AC02-06CH11357.
This work was partly supported by the Exasky Exascale Computing Project
(17-SC-20-SC), a collaborative eort of the U.S. Department of Energy Oce
of Science and the National Nuclear Security Administration. This work was
also supported by NSF Grants OAC-2003709, OAC-2303064, OAC-2104023,
OAC-2247080, OAC-2311875, OAC-2311876, OAC-2312673.
SC ’23, Nov 12–17, 2023, Denver, CO Wang et al.
REFERENCES
[1]
2023. HDF5 Filters. https://docs.hdfgroup.org/hdf5/develop/_f_i_l_t_e_r.html
Online.
[2]
Mark Ainsworth, Ozan Tugluk, Ben Whitney, and Scott Klasky. 2018. Multilevel
techniques for compression and reduction of scientic data—the univariate case.
Computing and Visualization in Science 19, 5–6 (2018), 65–76.
[3]
AMReX - HDF5 Plotle Compression. 2023. https://amrex- codes.github.io/amrex/
docs_html/IO.html#hdf5-plotle-compression. Online.
[4]
AMReX’s documentation. 2023. https://amrex-codes.github.io/amrex/docs_html/
Basics.html#boxarray. Online.
[5]
Allison H Baker, Dorit M Hammerling, and Terece L Turton. 2019. Evaluating
image quality measures to assess the impact of lossy data compression applied
to climate simulation data. In Computer Graphics Forum, Vol. 38. Wiley Online
Library, 517–528.
[6]
Kevin J Bowers, BJ Albright, L Yin, B Bergen, and TJT Kwan. 2008. Ultrahigh
performance three-dimensional electromagnetic relativistic kinetic plasma simu-
lation. Physics of Plasmas 15, 5 (2008), 055703.
[7]
Franck Cappello, Sheng Di, Sihuan Li, Xin Liang, Ali Murat Gok, Dingwen Tao,
Chun Hong Yoon, Xin-Chuan Wu, Yuri Alexeev, and Frederic T Chong. 2019.
Use cases of lossy compression for oating-point data in scientic data sets. The
International Journal of High Performance Computing Applications (2019).
[8] cuZFP. 2023. https://github.com/LLNL/zfp/tree/develop/src/cuda_zfp. Online.
[9] Sheng Di. 2023. H5Z-SZ. https://github.com/disheng222/H5Z-SZ Online.
[10]
Sheng Di and Franck Cappello. 2016. Fast error-bounded lossy HPC data com-
pression with SZ. In 2016 IEEE International Parallel and Distributed Processing
Symposium. IEEE, 730–739.
[11]
Sheng Di and Franck Cappello. 2016. Fast error-bounded lossy HPC data com-
pression with SZ. In 2016 IEEE International Parallel and Distributed Processing
Symposium. IEEE, IEEE, Chicago, IL, USA, 730–739.
[12]
L. Fedeli, A. Huebl, F. Boillod-Cerneux, T. Clark, K. Gott, C. Hillairet, S. Jaure, A.
Leblanc, R. Lehe, A. Myers, C. Piechurski, M. Sato, N. Zaim, W. Zhang, J. Vay, and
H. Vincenti. 2022. Pushing the Frontier in the Design of Laser-Based Electron
Accelerators with Groundbreaking Mesh-Rened Particle-In-Cell Simulations on
Exascale-Class Supercomputers. In SC22: International Conference for High Per-
formance Computing, Networking, Storage and Analysis. IEEE Computer Society,
Los Alamitos, CA, USA, 1–12. https://doi.org/10.1109/SC41404.2022.00008
[13]
Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011.
An overview of the HDF5 technology suite and its applications. In Proceedings of
the EDBT/ICDT 2011 Workshop on Array Databases. 36–47.
[14]
Pascal Grosset, Christopher Biwer, Jesus Pulido, Arvind Mohan, Ayan Biswas,
John Patchett, Terece Turton, David Rogers, Daniel Livescu, and James Ahrens.
2020. Foresight: analysis that matters for data reduction. In 2020 SC20: Inter-
national Conference for High Performance Computing, Networking, Storage and
Analysis (SC). IEEE Computer Society, 1171–1185.
[15]
Guénolé Harel, Jacques-Bernard Lekien, and Philippe P Pébaÿ. 2017. Two new con-
tributions to the visualization of AMR grids: I. interactive rendering of extreme-
scale 2-dimensional grids ii. novel selection lters in arbitrary dimension. arXiv
preprint arXiv:1703.00212 (2017).
[16] hipdac tac. 2023. https://github.com/hipdac-lab/HPDC22- TAC. Online.
[17]
Sian Jin, Pascal Grosset, Christopher M Biwer, Jesus Pulido, Jiannan Tian, Ding-
wen Tao, and James Ahrens. 2020. Understanding GPU-Based Lossy Compression
for Extreme-Scale Cosmological Simulations. arXiv preprint arXiv:2004.00224
(2020).
[18]
Sian Jin, Jesus Pulido, Pascal Grosset, Jiannan Tian, Dingwen Tao, and James
Ahrens. 2021. Adaptive Conguration of In Situ Lossy Compression for Cos-
mology Simulations via Fine-Grained Rate-Quality Modeling. arXiv preprint
arXiv:2104.00178 (2021).
[19]
Sian Jin, Dingwen Tao, Houjun Tang, Sheng Di, Suren Byna, Zarija Lukic, and
Franck Cappello. 2022. Accelerating parallel write via deeply integrating predic-
tive lossy compression with HDF5. arXiv preprint arXiv:2206.14761 (2022).
[20]
Xin Liang, Sheng Di, Dingwen Tao, Sihuan Li, Shaomeng Li, Hanqi Guo, Zizhong
Chen, and Franck Cappello. 2018. Error-controlled lossy compression optimized
for high compression ratios of scientic datasets. In 2018 IEEE International
Conference on Big Data. IEEE, 438–447.
[21]
Xin Liang, Kai Zhao, Sheng Di, Sihuan Li, Robert Underwood, Ali M. Gok, Jiannan
Tian, Junjing Deng, Jon C. Calhoun, Dingwen Tao, Zizhong Chen, and Franck
Cappello. 2022. SZ3: A Modular Framework for Composing Prediction-Based
Error-Bounded Lossy Compressors. IEEE Transactions on Big Data (2022), 1–14.
https://doi.org/10.1109/TBDATA.2022.3201176
[22]
Peter Lindstrom. 2014. Fixed-rate compressed oating-point arrays. IEEE Trans-
actions on Visualization and Computer Graphics 20, 12 (2014), 2674–2683.
[23] Peter Lindstrom. 2023. H5Z-ZFP. https://github.com/LLNL/H5Z-ZFP Online.
[24]
Tao Lu, Qing Liu, Xubin He, Huizhang Luo, Eric Suchyta, Jong Choi, Norbert
Podhorszki, Scott Klasky, Mathew Wolf, Tong Liu, et al
.
2018. Understanding
and modeling lossy compression schemes on HPC scientic data. In 2018 IEEE
International Parallel and Distributed Processing Symposium. IEEE, 348–357.
[25]
Huizhang Luo, Dan Huang, Qing Liu, Zhenbo Qiao, Hong Jiang, Jing Bi, Haitao
Yuan, Mengchu Zhou, Jinzhen Wang, and Zhenlu Qin. 2019. Identifying Latent
Reduced Models to Precondition Lossy Compression. In 2019 IEEE International
Parallel and Distributed Processing Symposium. IEEE.
[26]
Huizhang Luo, Junqi Wang, Qing Liu, Jieyang Chen, Scott Klasky, and Norbert
Podhorszki. 2021. zMesh: Exploring Application Characteristics to Improve Lossy
Compression Ratio for Adaptive Mesh Renement. In 2021 IEEE International
Parallel and Distributed Processing Symposium (IPDPS). IEEE, 402–411.
[27] NYX simulation. 2019. https://amrex-astro.github.io/Nyx/. Online.
[28]
Oak Ridge Leadership Computing Facility. [n.d.]. Summit Supercomputer. https:
//www.olcf.ornl.gov/summit/
[29]
Oak Ridge Leadership Computing Facility. 2023. WarpX, granted early access
to the exascale supercomputer Frontier, receives the high-performance computing
world’s highest honor. https://www.olcf.ornl.gov/2022/11/17/plasma-simulation-
code-wins- 2022-acm- gordon-bell- prize/ Online.
[30]
Russ Rew and Glenn Davis. 1990. NetCDF: an interface for scientic data access.
IEEE computer graphics and applications 10, 4 (1990), 76–82.
[31]
James M Stone, Kengo Tomida, Christopher J White, and Kyle G Felker. 2020.
The Athena++ adaptive mesh Renement framework: Design and magnetohy-
drodynamic solvers. The Astrophysical Journal Supplement Series 249, 1 (2020),
4.
[32]
Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. 2017. Signicantly
improving lossy compression for scientic data sets based on multidimensional
prediction and error-controlled quantization. In 2017 IEEE International Parallel
and Distributed Processing Symposium. IEEE, 1129–1139.
[33]
Dingwen Tao, Sheng Di, Zizhong Chen, and Franck Cappello. 2017. Signicantly
improving lossy compression for scientic data sets based on multidimensional
prediction and error-controlled quantization. In 2017 IEEE International Parallel
and Distributed Processing Symposium. IEEE, 1129–1139.
[34]
Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, and Franck Cappello. 2019.
Optimizing lossy compression rate-distortion from automatic online selection
between SZ and ZFP. IEEE Transactions on Parallel and Distributed Systems 30, 8
(2019), 1857–1871.
[35]
The HDF Group. 2023. Hierarchical data format version 5. http://www.hdfgroup.
org/HDF5 Online.
[36]
Jiannan Tian, Sheng Di, Xiaodong Yu, Cody Rivera, Kai Zhao, Sian Jin, Yunhe
Feng, Xin Liang, Dingwen Tao, and Franck Cappello. 2021. Optimizing error-
bounded lossy compression for scientic data on GPUs. In 2021 IEEE International
Conference on Cluster Computing (CLUSTER). IEEE, 283–293.
[37]
Jiannan Tian, Sheng Di, Kai Zhao, Cody Rivera, Megan Hickman Fulp, Robert
Underwood, Sian Jin, Xin Liang, Jon Calhoun, Dingwen Tao, and Franck Cap-
pello. 2020. cuSZ: An Ecient GP U-Based Error-Bounded Lossy Compression
Framework for Scientic Data. (2020), 3–15.
[38]
Marc-André Vef. 2016. Analyzing le create performance in IBM spectrum scale.
Master’s thesis, Johannes Gutenberg University Mainz (2016).
[39]
Daoce Wang, Jesus Pulido, Pascal Grosset, Sian Jin, Jiannan Tian, James Ahrens,
and Dingwen Tao. 2022. TAC: Optimizing Error-Bounded Lossy Compression
for Three-Dimensional Adaptive Mesh Renement Simulations. In Proceedings of
the 31st International Symposium on High-Performance Parallel and Distributed
Computing. 135–147.
[40]
Feng Wang, Nathan Marshak, Will Usher, Carsten Burstedde, Aaron Knoll, Timo
Heister, and Chris R. Johnson. 2020. CPU Ray Tracing of Tree-Based Adaptive
Mesh Renement Data. Computer Graphics Forum 39, 3 (2020), 1–12.
[41]
Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy
Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, et al
.
2019. AMReX:
a framework for block-structured adaptive mesh renement. Journal of Open
Source Software 4, 37 (2019), 1370–1370.
[42]
Kai Zhao, Sheng Di, Maxim Dmitriev, Thierry-Laurent D Tonellot, Zizhong Chen,
and Franck Cappello. 2021. Optimizing error-bounded lossy compression for
scientic data by dynamic spline interpolation. In 2021 IEEE 37th International
Conference on Data Engineering (ICDE). IEEE, 1643–1654.