Methodology for Performance Evaluation of the Input/Output System on Computer Clusters.
ABSTRACT The increase of processing units, speed and computational power, and the complexity of scientific applications that use high performance computing require more efficient Input/Output (I/O) systems. In order to efficiently use the I/O it is necessary to know its performance capacity to determine if it fulfills applications I/O requirements. This paper proposes a methodology to evaluate I/O performance on computer clusters under different I/O configurations. This evaluation is useful to study how different I/O subsystem configurations will affect the application performance. This approach encompasses the characterization of the I/O system at three different levels: application, I/O system and I/O devices. We select different system configuration and/or I/O operation parameters and we evaluate the impact on performance by considering both the application and the I/O architecture. During I/O configuration analysis we identify configurable factors that have an impact on the performance of the I/O system. In addition, we extract information in order to select the most suitable configuration for the application.
- [Show abstract] [Hide abstract]
ABSTRACT: In this paper a method for execution programming of data-intensive applications is presented. The method is based on storage Quality of Service SQoS provisioning. SQoS provisioning uses the semantic based storage monitoring based on a storage resources model and a storage performance management. Test results show the gain for the execution time when using the QStorMan toolkit which implements the presented method. Taking into account the SQoS provisioning opportunity on the one hand, and the increasingly growing user demands on the other hand, we believe that the execution programming of data-intensive applications can bring a new quality into the application execution.Scientific Programming 01/2012; 20(1):69-80. · 1.04 Impact Factor
Methodology for Performance Evaluation of the
Input/Output System on Computer Clusters
Sandra M´ endez, Dolores Rexachs and Emilio Luque
Computer Architecture and Operating System Department (CAOS)
Universitat Aut` onoma de Barcelona
Abstract—The increase of processing units, speed and com-
putational power, and the complexity of scientific applications
that use high performance computing require more efficient
Input/Output (I/O) systems. In order to efficiently use the I/O
it is necessary to know its performance capacity to determine if
it fulfills applications I/O requirements. This paper proposes a
methodology to evaluate I/O performance on computer clusters
under different I/O configurations. This evaluation is useful to
study how different I/O subsystem configurations will affect the
This approach encompasses the characterization of the I/O
system at three different levels: application, I/O system and
I/O devices. We select different system configuration and/or I/O
operation parameters and we evaluate the impact on performance
by considering both the application and the I/O architecture.
During I/O configuration analysis we identify configurable factors
that have an impact on the performance of the I/O system.
In addition, we extract information in order to select the most
suitable configuration for the application.
Keywords-Parallel I/O System, I/O Architecture, Mass Storage,
The increase in processing units, the advance in speed
and computational power, and the increasing complexity of
scientific applications that use high performance computing
require more efficient Input/Output (I/O) Systems. The perfor-
mance of many scientific applications is inherently limited by
the input/output system. Due to the historical “gap“ between
the computing and I/O performance, in many cases, the I/O
system becomes the bottleneck of parallel systems. In order
to hide this ”gap”, the I/O factors with the biggest effect
on performance must be identified. This situation leads us
to ask ourselves the following questions: Should scientific
applications adapt to I/O configurations?, How do I/O factors
influence performance? How does the I/O system should be
configured to adapt to the scientific application? Answering
these questions is not trivial. The designer or administrator
has the difficulty either to select components of I/O subsystem
(JBOD (Just a Bunch Of Disks), RAID level, filesystem,
interconnection network, among other factor) or to choose
from different connection models (DAS, SAN, NAS, NASD)
with different parameters to configure (redundancy level, band-
width, efficiency, among others). Programmers can modify
their programs to efficiently manage I/O operations, but they
need to know, at least succintly, the I/O system. To efficiently
use the I/O system it is necessary to know its performance
capacity to determine if it fulfills the I/O requirements of
applications. There are several research work on I/O system
performance evaluation. These studies were made for specific
parallel computer configurations. I/O in a computer cluster
occurs on a hierarchal I/O path. The application carries out
the I/O operations in this hierarchical I/O path.
We propose an I/O system evaluation that takes into account
both the application requirements and the I/O configuration
by focusing on the I/O path. The proposed methodology to
evaluate I/O system performance has three phases: character-
ization, analysis of I/O configuration, and evaluation. In the
characterization phase, we extract the application I/O require-
ments, bandwidth and I/O operations per second (IOPs). With
this information we determine the amount of data and type
(file level or block level) that needs to be stored and shared.
We evaluate the bandwidth, IOPs, and latency at filesystem
level, network, I/O library and I/O devices. In the second
phase, the I/O configuration analysis, we identify configurable
factors that impact the I/O system performance. We analyze
the filesystem, I/O node connection, placement and state of
buffer/cache, data redundancy and service redundancy. We use
these factors along with application behavior to compare and
analyze the I/O configurations in the cluster. Finally, in the
third phase - the evaluation, we collect metrics of application
execution under different configurations. We determine the
inefficiency by analyzing the difference between measured
values and characterized values.
This article is focused in the characterization phase and
evaluation. The I/O configuration analysis is only applied to
two cases because the evaluation of different I/O architecture
configurations requires a variety of I/O resources or a simula-
tion tool. Thus, we are studying the SIMCAN  simulation
tool to implement the different I/O architectures.
The rest of this article is organized as follows: in Section
II we review the related work, Section III introduces our pro-
posed methodology. In Section IV we review the experimental
validation of this proposal. Finally, in the last section, we
present our conclusions and future work.
II. RELATED WORK
We need to understand the applications behavior and the I/O
configuration factors that have an impact on the application
performance. In order to do this, we study the state-of-
the-art I/O architecture and I/O characterization of scientific
Although following articles are focused on the supercom-
puter’s I/O system, we observed that the factors of I/O system
configuration that have an impact on the performance are
applicable both to small and medium computer clusters. There
are various papers that present different I/O configurations for
parallel computers and how these configurations are used for
improving the performance of the I/O subsystems.
The I/O performance analysis developed in the Sandia Na-
tional Laboratories over the Red Storm platform is presented in
. In the Red Storm I/O configuration there were I/O nodes
and a Data Direct Network (DDN) couplet. Four I/O nodes
are connected to each DDN couplet. In order to arrive at a
theoretical estimation for the Red Storm configuration, they
started with a single end to end path definition, across which
I/O operation travels.
In  is presented a highly scalable parallel file I/O ar-
chitecture for BlueGene/L, which leverages the benefit of
I/O configuration, which has the hierarchical and functional
partitioning design of the software system, by taking into
account computational and I/O cores. The architecture exploits
the scalability aspect of GPFS (General Parallel File System)
at the backend. MPI-IO were also used as an interface between
the application I/O and filesystem. The impact of their high
performance I/O solution for Blue Gene/L is demonstrated
with a comprehensive evaluation of a number of widely used
parallel I/O benchmarks and I/O intensive applications.
In  the authors presented an in-depth evaluation of
parallel I/O software stack of the Cray XT4 platform at
Oak Ridge National Laboratory (ORNL). The Cray XT4 I/O
subsystem was provided through 3 servers Lustre filesystems.
The evaluation covers the performance of a variety of parallel
I/O interfaces, including POSIX IO, MPI-IO, and HDF5.
Furthermore, a user-level perspective is presented in  to
empirically reveal the implications of storage organization of
parallel programs running on Jaguar at the ORNL. The authors
described the hierarchical configuration of the Jaguar Super-
computer Storage System. They evaluated the performance
of individual storage components. In addition, they examined
the scalability of metadata, and data-intensive benchmarks on
Jaguar, and they showed that the file distribution pattern can
have an impact on the aggregated I/O bandwidth.
In  a case study of the I/O challenges to performance and
scalability on the IBM Blue Gene/P system at the Argonne
Leadership Computing Facility was presented. The authors
evaluated both software and hardware of I/O system and a
study of PVFS and GPFS at filesystem level is presented. They
evaluate the I/O system for the NAS BT-IO, MadBench2, and
Flash3 I/O benhcmarks.
These works are focused on the filesystem, I/O architecture
and different I/O libraries, these solutions are designed for the
owners I/O system.
We observed that the performance achieved on the I/O
system is affected seriously by the I/O architecture and the
application characteristics. An understanding of the application
I/O behavior is necessary to efficiently use the I/O system in
all I/O path levels. The following papers are focused on the
I/O characterization of applications.
Carns  presented the Darshan tracing tool for the I/O
workloads characterization of the petascale. Darshan is de-
signed to capture an accurate picture of the application I/O
behavior, including properties such as patterns of access within
files, with minimum overhead. Also, in , Carns presented
a multilevel application I/O study and a methodology for sys-
temwide, continuous, scalable I/O characterization that com-
bines storage device instrumentation, static filesystem analysis,
and a new mechanism for capturing detailed application-level
The authors of  presented an approach for characteriza-
tion the I/O demands of applications on the Cray XT. They
also showed case studies of the use of their I/O infrastructure
characterization with climate studies and combustion simula-
Byna  used I/O signatures for parallel I/O prefetching.
This work is useful to identify patterns and I/O characteristics
to determine the application behavior.
Nakka  presented a tool to extract I/O traces from very
large applications that runs at full scale during production.
They analyze these traces to obtain information of the appli-
The previous papers showed the performance evaluation
of different I/O configurations of parallel systems. The per-
formance evaluation of I/O system is done at different I/O
levels: I/O library, filesystem, storage network and devices.
This, added to the diversity of I/O architectures on computer
clusters, makes the evaluation of the performance of I/O
systems difficult. Therefore, we propose a methodology for
the performance evaluation on different I/O configurations. We
propose the characterization of the applications I/O requeri-
ments and behavior and the characterization of I/O system at
I/O library level, filesystem, storage network and I/O devices.
Thus, we intent to cover the I/O path of data on the I/O system
of a computer cluster. Also, we focus on specific components
of the I/O architecture that we consider have the biggest impact
on the performance of I/O system.
III. PROPOSED METHODOLOGY
In order to evaluate the I/O system performance is necessary
to know its capacity of storage and throughput. The storage
depends on the amount, type and capacity of the devices.
The throughput depends on IOPs (Input/Output operations per
second) and the latency. Moreover, this capacity is diferent in
each I/O system level. The performance also depends on the
connection of the I/O node, the management of I/O devices,
placement of I/O node into network topology, buffer/cache
Fig. 1.Methodology for Performance Evaluation on I/O System
state and placement, and availability data and service. Fur-
thermore, to determine if an application uses the whole I/O
system capacity, it is necessary to know its I/O behavior
and requirements. It is necessary to characterize the behavior
of the I/O system and the application to evaluate the I/O
system performance. We propose a methodology composed
of three phases: Characterization, I/O Configuration Analysis,
and Evaluation. The methodology is shown in Fig. 1. This
is used to evaluate the used percentage of I/O system perfor-
mance and identify the possible points of inefficiency. Also,
when the cluster has different I/O configurations, it is used
to analyze which configuration is the most appropiate for an
The characterization phase is divided in two parts: Applica-
tion and System (I/O system and I/O devices). This is applied
to obtain the capacity and performance of the I/O system.
Here, we explain the system characterization and the scientific
1) I/O System and Devices: Parallel system is characterized
at three levels: I/O library, I/O Node (filesystem) and devices
(local filesystem). We characterize the bandwidth (bw), latency
(l) and (IOPs) for each level, as shown in Fig. 2.
Fig. 3 shows ”what” and ”how” we obtain this information
for I/O system and devices. Also, we obtain characterized
configurations with their performance tables in each I/O path
level. In TABLE I we present the data structure of I/O system
performance table for filesystem and I/O library. To evaluate
global filesystem and local filesystem, IOzone  and/or
bonnie++  benchmarks can be used. Parallel filesystem
can be evaluated with the IOR benchmark . It is possible
to use b eff io  or IOR benchmarks for Library level.
To explain this phase we present the characterization for
the I/O system of the cluster Aohyper. This cluster has the
Fig. 2. I/O system Characterization
Fig. 3.Characterization phase for the I/O system and Devices
following characteristics: 8 nodes AMD Athlon(tm) 64 X2
Dual Core Processor 3800+, 2GB RAM memory, 150GB local
disk. Local filesystem is linux ext4 and global filesystem is
NFS. The NFS server has a RAID 1 (2 disks) with 230GB
capacity and RAID 5 (5 disks) with stripe=256KB and 917GB
capacity, both with write-cache enabled (write back); two
Gigabit Ethernet networks, one for communication and the
other for data.
The cluster Aohyper, at I/O device level, has three I/O
configurations (Fig. 4). JBOD configuration is single disk
without redundancy. RAID 1 configuration has a disk with
its mirror disk and RAID 5 has five disks.
The parallel system and storage devices characterization
were done with IOzone. Fig. 5 shows results for network
filesystem and local filesystem for the three configurations.
The experiments were performed at block level with a file
size which doubles the main memory size, , and the block size
was changed from 32KB to 16MB.
DATA STRUCTURE OF I/O PERFORMANCE TABLE
Local (0), Global (1)
Sequential, Strided, Random
read (0), write(1)
Fig. 4.I/O configurations of the cluster Aohyper
(b) RAID 1
(c) RAID 5
Fig. 5. Local filesystem and Network filesystem Characterization
The IOR benchmark was used to analyze the I/O library. It
was configured for 32GB size of file on RAID configurations
and 12 GB on JBOD, from 1MB to 1024MB block size
and transfer block size of 256KB. It was launched with 8
processes. Fig. 6 shows the characterization on the three
2) Scientific Application: We have characterized the appli-
cation to evaluate the I/O system utilization and to know the
I/O requirements. The applicaton performance is measured by
the I/O time, the transfer rate and IOPs. We extract the type,
(b) RAID 1
(c) RAID 5
Fig. 6.I/O Library Characterization
amount and size of I/O operations at library level.
Fig. 7 shows ”what”, ”how”, and the monitored information
of the application. This information is used in the evaluation
phase to determine whether the application performance is
limited by the application characteristics or by the I/O system.
To evaluate the application characterization at process level,
an extension of PAS2P  tracing tool was developed.
PAS2P identifies and extracts phases of the application, and
by similarity analysis, this selects the significant phases (by
analyzing compute and communication) and their weights. The
representative phases are used to create a Parallel Application
Signature and predict the application performance. PAS2P
instruments and executes applications in a parallel machine,
and produces a trace log. The data collected is used to
characterize computational and communication behavior. We
incorporate the I/O primitives to the PAS2P tracing tool to
capture the relationship between the computations and the I/O
operations. We trace all I/O primitives of MPI-2 standard.
Thus, we created a library
which is loaded when the application is executed with
Fig. 7. Characterization phase for the Application
The tracing tool was extended to capture the information
necessary to define (in the future) a functional model of the
application. We propose to identify the significant phases with
an access pattern and their weights. With the characterization,
we try to find the application I/O phases. Due to the fact that
scientific applications show a repetitive behavior, m phases
will exist in the application. Such behavior can be observed
in Figs. 8 and 16 (graphics generated with Jumpshot and MPE
tracing tool), where both NAS BT-IO and MadBench2 show
To explain the methodology, the characterization is applied
to Block Tridiagonal(BT) application of NAS Parallel Bench-
mark suite (NPB). NAS BTIO , whose MPI implementa-
tion employs a fairly complex domain decomposition called
diagonal multi-partitioning, is a good case to test the speed of
parallel I/O. Each processor is responsible for multiple Carte-
sian subsets of the entire data set, whose number increases
as the square root of the number of processors participating
in the computation. Every five time steps the entire solution
field, consisting of five double-precision words per mesh point,
must be written to one or more files. After all time steps are
finished, all data belonging to a single time step must be stored
in the same file, and must be sorted by vector component,
x-coordinate, y-coordinate, and z-coordinate. We used two
implementation of BT-IO:
• simple: MPI I/O without collective buffering. This means
that no data rearrangement takes place, so that many seek
operations are required to write the data to file.
• full: MPI I/O with collective buffering. The data scattered
in memory among the processors is collected on a subset
of the participating processors and rearranged before it is
written to a file, in order to increase granularity.
The characterization done for the class C of NAS BT-IO
in full and simple subtypes is shown in TABLE II. Fig. 8
shows the global behavior of NAS BT-IO.
subtype has 40 phases to write and 1 phase to read. A writing
operation is done after of the 120 messages sent and their
respective Wait and Wait All. The reading phase consists of
40 reading operations done after all writing procedures are
finished. This is done for each MPI process. The Simple
subtype has the same phases but each writing phase carries out
NAS BT-IO full
NAS BT-IO CHARACTERIZATION - CLASS C - 16 PROCESSES
2,073,600 and 2,125,440
2,073,600 and 2,125,440
1.56KB and 1.6KB
1.56KB and 1.6KB
(a) Full subtype(b) Simple subtype
writing is purple and yellow is communication
NAS BT-IO traces for 16 processes where reading is green colour,
6,561 writing operations. The reading phase performs 262,440
B. Input/Output Configuration Analysis
In this 2nd phase of the methodology, we identify I/O con-
figurable factors and select I/O configurations. This selection
depends on user requirements, as shown in Fig. 1. We explain
the configurable factors and the I/O configuration selection.
1) Configurable Factors: We considered, in the I/O archi-
tecture, the next configurable factors: number and type of
filesystem (local, distributed and parallel), number and type
of network (dedicated use and shared with the computing),
state and placement of buffer/cache, number of I/O devices,
I/O devices organization (RAID level, JBOD), and number and
placement of I/O node.
For our example the cluster Aohyper has ext4 as local
filesystem and NFS as global filesystem. NFS server is an
I/O node for shared accesses and there are eight I/O nodes
for local accesses where the data sharing must be done by the
user. There are two networks, one for services and the other
for data transfer.
The cluster Aohyper has two levels of RAID: 1 and 5, and
a JBOD. There is no redundancy of service (duplicated I/O
2) I/O Configuration selection: The configuration is se-
lected based on the performance provided in the I/O path and
the RAID level.
We tested the I/O system for different software RAID levels
within local disks, and for both network configurations, a
shared network or two splitted networks, one for communi-
cation and the other for data transfering.
For this article we have selected three configurations: JBOD,
RAID 1 and RAID 5.
In the evaluation phase, the application is run on each I/O
configuration selected. Application values are compared with
Fig. 9.Evaluation Phase
characterized values by each configuration to determine the
utilization and possible points of inefficiency in the I/O path.
Fig. 9 shows the evaluation step. In this phase, we prepare the
evaluation environment, and also define I/O metrics for per-
formance evaluation and the application analysis for different
For the evaluation phase, we set parameters for the appli-
cation, the library and the architecture. For our example, we
evaluate the NAS BT-IO class C with MPICH library.
1) Selection of I/O metrics: The metrics for the evaluation
are: execution time, I/O time ( time to do reading and writing
operations ), I/O operations per second (IOPs), latency of I/O
operations and throughput (number of megabytes transferred
2) Analysis of relationship between I/O factors and I/O
metrics: We compare measures of application execution in
each configuration with characterized values of I/O path levels
(Fig. 2). Each configuration has a performance table by each
level in I/O path. Fig. 10 shows the flowchart for generating
the used percentage table, its logical trace is explained in the
• Reading of the operation type, block size, access type,
access mode and transfer rate (bw) of aplication execution
• Searching on the file of Performance (TablePerf) the
characterized transfer rate in the different I/O path levels
based on operation type, block size of operation, access
mode, and access type of the application.
• The used porcentage by the application is calculated with
the characterized transfer rate on each I/O path level.
The algorithm to search the transfer rate on each I/O level is
shown in Fig. 11; and it is applied in each search stage of Fig.
10. Fig. 11 is explained with the following steps:
• Opening the table of performance and setting the variable
”found” to stop the searching when the values are found.
• If the operation type, acces mode, and access type is equal
to a value in the performance table, and the block size of
the operation is:
– less than minimum block size of the performance
table then it selects the transfer rate corresponding
to minimum block size.
– greater than maximun block size of performance
table then, it selects the transfer rate corresponding
to the maximun block size.
– equal to a block size of the performance table then it
selects the transfer rate corresponding to such block
Fig. 10.Generation Algorithm for the table of used percentage
Fig. 11.Searching algorithm on performance table
– a value between the characterized values then it
selects the closest upper value to the searched value.
• When the search finishes then the performance table is
closed and the transfer rate is returned.
The characterized values were measured under stressed I/O
system. When the application is not limited by I/O on a
specific level the used percentage probably surpass the 100%.
Then we evaluate the next level in the I/O path to analyze the
use of the I/O system.
Fig. 12. NAS BT-IO Class C 16 Processes
PERCENTAGE (%) OF I/O SYSTEM USE FOR NAS BT-IO ON WRITING
RAID5 24.60 29.52
Following our example, we analyze NAS BT-IO in the
cluster Aohyper. Fig. 12 shows the execution time, I/O time
and throughput for NAS BT-IO class C using 16 processes
executed on three different configurations.
The evaluation is for full subtype (with collectives I/O) and
simple subtype (without collectives). The percentage of using
of the I/O system is shown in TABLE III for writing operations
and TABLE IV for reading operations.
The full subtype is a more efficient implementation than
the simple subtype for NAS BT-IO and we observe that the
capacity of I/O system for class C is exploited. But, for the
simple subtype this I/O system is only used for about 30%
of performance on reading operations and less than 15% on
writing operations. NAS BT-IO simple subtype does 4,199,040
writes and 4,199,040 reads with block size of 1,600 and 1,640
bytes (TABLE II). This has a high penalization in I/O time
impacting on execution time (Fig. 12).
For this application in the full subtype the I/O is not factor
bounding because the capacity of I/O system is sufficient
for I/O requirements. The simple subtype does not manage
to exploit the I/O system to its capacity due to its access
pattern. On the other hand, when we evaluate the more
suitable configuration for the application, the full subtype
has similar performance on the three configurations; but the
selection depends on the level of availability that the user is
willing to pay for. Furthermore, the proposed methodology us
allowed characterize the behavior of NAS BT-IO and quantify
differences of the I/O system use.
In order to test the methodology, an evaluation of NAS
BT-IO for 16 and 64 processes in a different cluster was
carried out. This cluster is called the cluster A. Furthermore,
we evaluated MadBench2  Benchmark on Clusters A and
The cluster A is composed of 32 compute nodes: 2 x Dual-
Core Intel (R) Xeon (R) 3.00GHz, 12 GB of RAM, and 160
PERCENTAGE (%) OF I/O SYSTEM USE FOR NAS BT-IO ON READING
JBOD 309.74 127.93
RAID1 310.00 128.04
RAID5 303.11 125.20
Fig. 13.Local filesystem and network filesystem results for the cluster A
GB SATA disk Dual Gigabit Ethernet. A front-end node as
NFS server: Dual-Core Intel (R) Xeon (R) 2.66GHz, 8 GB of
RAM, RAID 5 of 1.8 TB and Dual Gigabit Ethernet.
A. System and Devices Characterization
Characterization of I/O system on cluster A is presented in
Fig. 13. We evaluate the local and network filesystem with
IOzone. IOR benchmark to evaluate the I/O library was done
with 40 GB filesize, block size from 1 MB to 1024 MB, and
256 KB transfer block (Fig.14).
B. NAS BT-IO Characterization
The cluster A characterization for 16 processes is shown in
TABLE II. As we analyze the application behavior, it is not
necessary to re-characterize the application in other system for
the same class and number of processes. Characterization for
64 processes is shown in TABLE V.
C. I/O configuration analysis
The cluster A has an I/O node that provides service to shared
files by NFS and storage with RAID 5 level. Furthermore,
there are thirty-two I/O-compute nodes for local and inde-
pendent accesses. Due to the I/O characteristics of Cluster A,
where there are no different I/O configurations, we used the
Fig. 14.I/O library results on the cluster A
NAS BT-IO CHARACTERIZATION - CLASS C - 64 PROCESSES
800 bytes and 840 bytes
800 bytes and 840 bytes
Fig. 15.NAS BT-IO Clase C - 16 and 64 processes
methodology to evaluate the percentage of I/O system used
for NAS BT-IO and MadBench2.
D. NAS BT-IO Evaluation
NAS BT-IO is executed for 16 and 64 processes to evaluate
the use of capacity and performance. Fig. 15 shows execution
time, I/O time and throughput for NAS BT-IO full and simple
subtypes. TABLE VI shows the percentage of use on I/O
library, NFS and Local filesystem for NAS BT-IO on writing
In TABLE VII we present the percentage of use on I/O
library, NFS and Local filesystem for NAS BT-IO on reading
The full subtype is an efficient implementation that achieves
more than 100% of the characterized performance on the
input/output library for both 16 and 64 processes. Although,
with a greater number of processes, the I/O system affects the
run time of the application. NAS BT-IO full subtype is limited
in Cluster A by computing and/or communication. NAS BT-
IO full subtype does not achieve 50% of NFS characterized
values and the I/O time is increased with a greater number
PERCENTAGE (%) OF I/O SYSTEM USE FOR NAS BT-IO ON WRITING
Number of Processes
PERCENTAGE (%) OF I/O SYSTEM USE FOR NAS BT-IO ON READING
Number of Processes
MADBENCH2 CHARACTERIZATION - 16 AND 64 PROCESSES
16 x file
16 x file
16 x file
16 x file
(a) SHARED filetype(b) UNIQUE filetype
Fig. 16. MadBench2 traces for 16 processes where sky-blue colour represents
reading operations and green colour for writing operations
of processes, due to communication among processes and the
I/O operations. NAS BT-IO simple subtype has a low use of
the I/O system. Furthermore, this is limited by I/O for this
I/O configuration of cluster A. The I/O time is greater than
90% of the run time. For this system, the I/O network and
communication are bounding the application performance.
E. MadBench2 Characterization
MADbench2 is a tool for testing the overall integrated
performance of the I/O, communication and calculation sub-
systems of massively parallel architectures under the stresses
of a real scientific application. MADbench2 is based on the
MADspec code, which calculates the maximum likelihood an-
gular power spectrum of the Cosmic Microwave Background
radiation from a noisy pixelized map of the sky and its pixel-
pixel noise correlation matrix. MADbench2 can be run as
single or multi-gang; in the former all the matrix operations
are carried out distributed over all processors, whereas in the
latter the matrices are built, summed and inverted over all
the processors (S & D), but then redistributed over subsets
of processors (gangs) for their subsequent manipulations (W
& C). MADbench2 can be to run on IO mode, in which all
calculations/communications are replaced with busy-work, and
the D function is skipped entirely. The function S writes, W
reads and writes, C reads. This is denoted as S w, W w, W r,
C r. MADbench2 reports the mean, minimum and maximum
times spent in calculation/communication, busy-work, reading
and writing in each function. Running MADbench2 requires
a square number of processors.
Fig. 16 shows MadBench2 traces for 16 processes with
UNIQUE and SHARED filetypes. TABLE VIII shows the
characterization for 16 and 64 processes with UNIQUE and
MadBench2 has three I/O phases, a writing phase for the
function S with 8 writing operations, a writing-reading phase
(a) UNIQUE filetype - Time and transfer rate
(b) SHARED filetype - Time and transfer rate
Fig. 17.MadBench2 results on the cluster Aohyper.
for function W with 8 writing operations and 8 reading
operations, and a reading phase for function C with 8 reading
operations. This is done for each process of the MPI world.
F. MadBench2 Evaluation on the cluster Aohyper
We evaluate MadBench2 for the previous three config-
urations on the Cluster Aohyper with 16 processes. Mad-
Bench2 parameters are set as IOMETHOD = MPI, IOMODE
= SYNC, 18 KPIX and 8 BIN. Fig. 17(a) shows re-
sults for FILYTPE=UNIQUE and Fig. 17(b) shows for FI-
MadBench2 surpasses the I/O library and network filesys-
tem performance both for UNIQUE and SHARED filetypes
on reading and writing operations. This is because MadBench
reads and writes large block sizes. Due to this situation we
only present the table of percentage of use on local filesystem.
In TABLE IX, the percentage used on local filesystem by
Madbench2 is shown. At local filesystem level MadBench2
on JBOD exploits the I/O capacity on SHARED subtype. For
the UNIQUE subtype, Madbench2 uses 10% less than the
SHARED filetype. MadBench2 on RAID 1 achieves the use of
around 50% of I/O performance both for reading and writing
on SHARED and UNIQUE filetypes. On RAID 5, MadBench2
use 30% of I/O performance both for reading and writing in
For MadBench2, the most suitable configuration is RAID
5 because this I/O configuration provides higher transfer rate
for reading and writing operations. It also affects the impact
I/O time has on run time.
PERCENTAGE (%) OF USE FOR MADBENCH2 ON LOCAL FILESYSTEM
(a) UNIQUE filetype - Time and transfer rate
(b) SHARED filetype - Time and transfer rate
Fig. 18. MadBench2 results on the cluster A.
G. MadBench2 Evaluation on the Cluster A
We evaluate MadBench2 on the Cluster A for 16 and 64 pro-
cesses with IOMETHOD = MPI, IOMODE = SYNC, 18 KPIX
and 8 BIN. Fig. 18(a) shows results for FILYTPE=UNIQUE
and Fig. 18(b) shows for FILYTPE=SHARED.
TABLE X shows the percentage used at network filesystem
level. In TABLE XI, the percentage used on local filesystem
by Madbench2 is shown.
At I/O library level, MadBench surpasses the I/O perfor-
mance for 16 and 64 processes for both filetypes. For this
reason, we only present the tables of percentage of use for the
network and local filesystems. We observe the highest values
for reading with UNIQUE filetype and 64 processes. This is
because the filesize per each process is less than the RAM
size, and the reading operations are done on buffer/cache and
USED PERCENTAGE (%) BY MADBENCH2 ON NETWORK FILESYSTEM
W r C rS wW wFILETYPE
USED PERCENTAGE (%) BY MADBENCH2 ON LOCAL FILESYSTEM
not physically on the disk.
MadBench2 has a different performance on each I/O phase.
At network filesystem level, the I/O system is used almost
to capacity with 64 processes for UNIQUE and SHARED
filetypes. For 16 processes, on SHARED filetype the funtions
S and W , it uses around 15% more I/O performance than
the UNIQUE filetype for writing. This situation changes
for the UNIQUE filetype where the percentage of use on
reading is about 30% higher than for the SHARED filetype.
This situation is also observed at the local filesystem level.
Although, the difference of percentage of use is about 3% for
writing and 10% for reading operations between both filetypes.
MadBench2 has the best performance on the cluster A, with 64
processes and UNIQUE filetype configuration. Nevertheless,
with SHARED filetype an aceptable performance is obtained
and also the percentage of use is more than 85% for the I/O
system at network filesystem level.
A methodology to analyze I/O performance of parallel
computers has been proposed and applied. Such methodology
encompasses the characterization of the I/O system at different
levels: device, I/O system and application. We analyzed and
evaluated the configuration of different elements that impact
performance by considering the application and the I/O ar-
chitecture. This methodology was applied in two different
clusters for the NAS BT-IO benchmark and MadBench2.
The characteristics of both I/O systems were evaluated, as
well as their impact on the performance of the application.
We also observe that the same application has a different
transfer rates for writing and reading operations on the same
I/O configuration. This situation affects the selection of the
proper configuration because it will be necessary to analyze the
operation with more weight for the application. MadBench2 is
an example that shows the impact of the same configuration on
different phases of the application. This situation is observed
in the experimentation.
As future work, we aim to define an I/O model of the
application to support the evaluation, design and selection of
the configurations. This model is based on the application
characteristics and I/O system, and it is being developed to
determine which I/O configuration meets the performance
requirements of the user on a given system. We will extract
the functional behavior of the application and we will define
the I/O performance for the application given the functionality
of the application at I/O level. In order to test other configu-
rations, we are analyzing the simulation framework SIMCAN
and planning to use such tool to model I/O architectures.
This research has been supported by MICINN-Spain under
 A. N´ unez, et al., “Simcan: a simulator framework for computer archi-
tectures and storage networks,” in Simutools ’08: Procs of the 1st Int.
Conf. on Simulation tools and techniques for communications, networks
and systems & workshops.Belgium: ICST, 2008, pp. 1–8.
 J. H. Laros et al., “Red storm io performance analysis,” in CLUSTER
’07: Procs of the 2007 IEEE Int. Conf. on Cluster Computing.
IEEE Computer Society, 2007, pp. 50–57.
 H. Yu, R. Sahoo, C. Howson, G. Almasi, J. Castanos, M. Gupta,
J. Moreira, J. Parker, T. Engelsiepen, R. Ross, R. Thakur, R. Latham,
and W. Gropp, “High performance file i/o for the blue gene/l supercom-
puter,” in High-Performance Computer Architecture, 2006. The Twelfth
International Symposium on, 11-15 2006, pp. 187 – 196.
 W. Yu, S. Oral, J. Vetter, and R. Barrett, “Efficiency evaluation of cray
xt parallel io stack,” 2007.
 W. Yu, H. S. Oral, R. S. Canon, J. S. Vetter, and R. Sankaran, “Empirical
analysis of a large-scale hierarchical storage system,” in Euro-Par ’08:
Proceedings of the 14th international Euro-Par conference on Parallel
Processing. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 130–140.
 S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock, “I/O
performance challenges at leadership scale,” in Proceedings of SC2009:
High Performance Networking and Computing, November 2009.
 P. Carns, R. Latham, R. Ross, K. Iskra, S. Lang, and K. Riley, “24/7
Characterization of Petascale I/O Workloads,” in Proceedings of 2009
Workshop on Interfaces and Architectures for Scientific Data Storage,
 P. Carns, K. Harms, W. Allcock, C. Bacon, R. Latham, S. Lang, and
R. Ross, “Understanding and improving computational science storage
access through continuous characterization,” in 27th IEEE Conference
on Mass Storage Systems and Technologies (MSST 2011), 2011.
 P. C. Roth, “Characterizing the i/o behavior of scientific applications on
the cray xt,” in PDSW ’07: Procs of the 2nd int. workshop on Petascale
data storage. USA: ACM, 2007, pp. 50–55.
 S. Byna, Y. Chen, X.-H. Sun, R. Thakur, and W. Gropp, “Parallel
i/o prefetching using mpi file caching and i/o signatures,” in High
Performance Computing, Networking, Storage and Analysis, 2008. SC
2008. International Conference for, nov. 2008, pp. 1 –12.
 N. Nakka, A. Choudhary, W. Liao, L. Ward, R. Klundt, and M. Weston,
“Detailed analysis of i/o traces for large scale applications,” in High
Performance Computing (HiPC), 2009 International Conference on, dec.
2009, pp. 419 –427.
 W. D. Norcott, “Iozone filesystem benchmark,” Tech. Rep., 2006.
[Online]. Available: http://www.iozone.org/
 R. Coker, “Bonnie++ filesystem benchmark,” Tech. Rep., 2001.
[Online]. Available: http://www.coker.com.au/bonnie++/
 . S. J. Shan, Hongzhang, “Using ior to analyze the i/o performance for
hpc platforms,” LBNL Paper LBNL-62647, Tech. Rep., 2007. [Online].
 R. Rabenseifner and A. E. Koniges, “Effective file-i/o bandwidth bench-
mark,” in Euro-Par ’00: Procs from the 6th Int. Euro-Par Conference on
Parallel Procs.London, UK: Springer-Verlag, 2000, pp. 1273–1283.
 A. Wong, D. Rexachs, and E. Luque, “Extraction of parallel application
signatures for performance prediction,” in HPCC, 2010 12th IEEE Int.
Conf. on, sept. 2010, pp. 223 –230.
 P. Wong and R. F. V. D. Wijngaart, “Nas parallel benchmarks i/o version
2.4,” Computer Sciences Corporation, NASA Advanced Supercomputing
(NAS) Division, Tech. Rep., 2003.
 J. Carter, J. Borrill, and L. Oliker, “Performance characteristics
of a cosmology package on leading hpc architectures,” in High
Computer Science, L. Boug´ e and V. Prasanna, Eds., vol. 3296.
Springer Berlin / Heidelberg, 2005, pp. 21–34. [Online]. Available:
 M. Fahey, J. Larkin, and J. Adams, “I/o performance on a massively
parallel cray xt3/xt4,” in Parallel and Distributed Procs, 2008. IPDPS
2008. IEEE Int. Symp. on, 14-18 2008, pp. 1 –12.