The MOSIX Parallel I/O System for Scalable I/O Performance
Lior Amar, Amnon Barak
Institute of Computer Science
The Hebrew University of Jerusalem
Jerusalem, 91904 Israel
?and Amnon Shiloh
This paper presents the MOSIX Scalable Parallel In-
put/Output (MOPI) system that uses the process migration
that are scattered among different nodes. MOSIX is a Unix
based, cluster operating system that already supports pre-
emptive process migration for load-balancing and memory
ushering. MOPI supports splitting files to several nodes. It
can deliver a high I/O performance by migrating parallel
processes to the respective nodes that hold the data, as op-
posedto the traditionalway of bringingthe file’s data to the
processes. The paper describes the MOSIX infrastructure
for supporting parallel file operations and the functions of
MOPI. It then presents the performance of MOPI for some
data intensive application and its scalability.
Cluster computing, MOSIX, parallel I/O, scalable I/O.
Cluster computing has been traditionally associated with
high performance computing of massive CPU bound ap-
plications. Clustered systems can offer other advantages
for “demanding” applications, such as the ability to sup-
port large processes that span across the main memory of
several nodes, or cluster file systems that support parallel
access to files [3, 8, 10]. Recently, there is a growing de-
mand to run data intensive applications, that can process
several orders of magnitude faster than any existing single
computer. This paper presents a cluster based parallel I/O
system that could be useful for such applications.
Some packages for parallel I/O in clusters are the
Global File System (GFS) , the Parallel Virtual File
System (PVFS) , Panda , DataCutter  and Aba-
cus  .
GFS is a shared file system suitable for clusters with
shared storage, it enables clients to access the same shared
storage while keeping the data consistent. Its main draw-
back is the high hardware costs. Its scalability is yet un-
PVFS is designed as a client-server system in which
files are transparentlystripped across disks of multiple file-
servers. PVFS provides good performance but its main
drawbacks are its limited scalability (possibly due to the
?Copyright (c) 2002 Amnon Barak. All rights reserved.
use of the network for almost all I/O operations) and lack
same file. PVFS is particularly popular in Linux clusters
for running intensive I/O parallel processes.
Panda  is a parallel I/O library for multidimen-
sional arrays that supports the strategy of “part-time I/O”,
where each node can be both a client and/or a server, simi-
which optimize the network usage by migrating processes
closer to their data, Panda optimizes the performance by
initially selecting the I/O nodes (among the cluster nodes)
which will make the cluster use the minimal network re-
sources for a given description of anticipated I/O requests,
from clients to servers.
DataCutter  is a framework designed for devel-
oping data intensive applications in distributed environ-
ments.The programming model in DataCutter, called
“filter-stream programming”, represents components of
data-intensiveapplicationsas a set of filters. Each filter can
potentially be executed on a different host across a wide-
area network. Data exchange between any two filters is
described via streams, which are uni-directional pipes that
deliver data in fixed size buffers. The idea of changing the
placement of program components to better use the cluster
resources is similar to MOSIX. However, in MOSIX there
is no need to change the applications, and MOSIX incor-
porates automatic load balancing, whereas in DataCutter
assignments are manual.
Abacus  is a programming model and a run time
system that monitors and dynamically changes function
placement for applications that manipulate large data sets.
The programmingmodel encouragesprogrammersto com-
pose data-intensive applications from explicitly-migratable
functions, while the run time system monitors those func-
tions and dynamically migrates them in order to im-
prove the application performance. This project is sim-
ilar to MOSIX: both perform dynamic load balancing of
processes/functions based on run-time statistics-collection.
Thedifferenceis thatMOSIXworkswith genericprograms
whereas Abacus requires a programming model.
This paper presents a new paradigm for cluster high
I/O performance that combines data partitioning with dy-
namic workdistribution. The targetcluster consists of mul-
tiple, homogeneous workstations and servers (nodes) that
work cooperatively, making the cluster-wide resources ac-
cessible to processes on all nodes. The cluster file sys-
tem consists of several subtrees that are placed in different
nodes to allow parallel operations on different files. The
key feature of our cluster I/O system is the ability to bring
(migrate)the processto the file server(s)ratherthan the tra-
ditional way of bringing the file’s data to the process. Pro-
cess migration, which is already supported by MOSIX 
for load-balancing, creates a new, scalable, parallel I/O ca-
pabilities that are suitable for applications that need to pro-
cess large volumes of data. We describe the MOSIX Par-
allel I/O (MOPI) library, which provides means for (trans-
parent) partitioning of files to different nodes and enables
parallel access to different segments of a file.
The organization of the paper is as follows: Section 2
gives a short overview of MOSIX and its relevant features
for supporting parallel I/O. Section 3 presents MOPI and
Section 4 presents its performance. Our conclusions are
given in Section 5.
2 MOSIX Background
MOSIX is a software that was specifically designed to en-
hance the Unix kernel with cluster computing capabili-
ties. The core of MOSIX are adaptive algorithms for load-
balancing  and memory ushering , that monitor un-
even resource distribution among the nodes and use pre-
emptive process migration to automatically reassign pro-
cesses amongthe nodes(like in an SMP) in orderto contin-
uously take advantage of the best available resources. The
MOSIX algorithms are geared for maximal overall perfor-
mance, overhead-freescalability and ease-of-use.
2.1 The system image model
The granularity of the work distribution in MOSIX is the
Unix process. In MOSIX, each process has a uniquehome-
of the user. The system image model is a computing clus-
ter, in which every process seems to run at its home-node
and all the processes of a users’ session share the running
environmentof the home-node. Processes that migrate to a
remote (away from the home) node use local (in the remote
node) resources whenever possible but continue to inter-
act with the user’s environmentby forwardingenvironment
dependent system-calls to the home-node.
2.2 Process migration
MOSIX supports preemptive (completely transparent) pro-
cess migration, that can migrate almost any process, any
time, to any available node. After a process is migrated all
its system-calls are interceptedby a link layer at the remote
node. If a system-call is site independent, it runs on the re-
mote node. Otherwise, the system-call is forwarded to the
home-node, where it runs on behalf of the process.
The above scheme is particularly useful for CPU-
bound processes but inefficient for processes with inten-
sive I/O and/or file operations. This is due to the fact that
such processes are required to communicate with their re-
spective home-node environment for each I/O operation.
Clearly, in such cases these processes would be better off
not migrating. The next section describes a mechanism to
overcome this last problem.
2.3 The MOSIX File System (MFS)
We implementedthe MOSIX File System (MFS)  which
provides a unified view of all files on all mounted file sys-
tems (of any type) in all the nodes of a MOSIX cluster, as
if they were all within a single file system. For example,
if one mounts MFS on the /mfs mount-point, then the
file /mfs/1456/usr/tmp/myfile refers to the file
/usr/tmp/myfile on node #1456. MFS is scalable
because the number of MOSIX nodes is practically unlim-
ited. In MFS, each node in a MOSIX cluster can simul-
taneously be a file server and run processes as clients and
each process can work with any mountedfile system of any
type. One advantage of the MFS approach is that it allows
to raisethe client-serverinteractionto thesystem-calllevel,
which provides total consistency.
2.4Direct File System Access (DFSA)
DFSA  is a re-routing mechanism that reduces the ex-
tra overhead of running I/O oriented system-calls of a mi-
gratedprocess. This is doneby allowingmost such system-
calls to perform locally - in the process’s current node. In
addition to DFSA, MOSIX monitors the I/O operations of
each process in order to encourage a process that performs
moderate to high volume of I/O to migrate to the node in
which it doesmost of its I/O. One obviousadvantageis that
flexibility to migrate from their respective home-nodes for
2.5 Bringing the process to the file
Unlike most network file systems, which bring the data
from the file server to the client node over the network,
the MOSIX algorithms attempt to migrate the process to
the node in which the file resides. Usually most file opera-
tions are performed by a process on a single partition. The
MOSIX scheme has significant advantages over other net-
worked file systems as it allows a process to migrate and
use any local partition. Clearly this eliminates the commu-
nication overhead between the process and the file server
(except the cost of the process migration itself). We note
that the process migration algorithms monitor and weigh
the amount of I/O operations vs. the size of each process,
in an attempt to optimize the decision whether to migrate
the process or not.
3 The Design and Implementation of MOPI
This section presents the design and implementation of
MOPI, a library that provides means for parallel access to
files. MOPI supports partitioning of large files to indepen-
applications can access such files transparently, without
any knowledge of the partitioning scheme or the fact that
the files are partitioned. The MOPI library requests in-
formation about a file from one or more dedicated servers,
called Meta Manager (MM), that are responsible for man-
aging large files, e.g., allocating new segments, removing
segments, etc. The MM uses a general purpose daemon
(MOPID) for such service requests.
Below we describe the partitioning method of files,
then present the MOPI implementation.
3.1The MOPI file structure
A MOPI file consists of 2 parts: a Meta Unit and Data Seg-
ments, as shown in Figure 1.
Figure 1. The MOPI file structure
DS Size = 50MB
Storage On Node 1Storage On Node 2 Storage On Node 3
3.1.1 The Meta Unit (MU)
The MU stores the file attributes, includingthe numberand
ID of the nodes to which the file is partitioned; the size
of the partition unit (data segment) and the locations of the
segments. The functionsand operationsof the MU are sim-
ilar to the handlingof i-nodes in UNIX file-systems; e.g., it
can be stored in a stable storage when the file is not active
(not accessed by any process) and it is loaded by the MM
when the file becomes active.
3.1.2 Data segments
Data is divided into Data Segments (DS). A DS is the
smallest unit of data for I/O optimization, e.g., when con-
sidering a migration of the process to the data. The seg-
ment size could vary from 1MB to almost 4GB, and all the
segments of the same file must be of the same size. Seg-
ments are created when the file is partitioned among the
nodes. After its creation, each DS exists as an autonomous
unit. When created, each segment is assigned to a node us-
ing a round robin scheme. We note that other partitioning
schemes could be implemented, e.g., a random or a space
conserving schemes, which balance the use of disk space
among the nodes. Also note that as in regular UNIX files,
gaps (holes) might exist in the files. In that case reading
from a hole returns 0’s.
3.2 The MOPI implementation
The prototype MOPI implementation consists of three
parts: a user level library that is linked with the user ap-
plication; a set of daemons for managing meta-data and
an optional set of utilities to manage basic operations on
MOPI files from the shell. See Figure 2 for details.
Figure 2. The MOPI components
3.2.1 The MOPI interface
Inorderto use MOPI, theapplicationshouldbe modifiedto
use the MOPI functions, instead of the regular file system
The MOPI interface includes all the common UNIX
file access and manipulation functions, such as mopi open,
mopi close,mopi write,mopi read,
mopi lseek, mopi fcntl, mopi stat etc.
We note that mopi readahead is an asynchronous
function that return immediately, without finishing the ac-
tual read. It can be used to overlap computation with I/O
as well as to pre-load parts of files to the main memory of
nodes, then to migrate processes to the respective nodes to
benefit from parallel, local access to data.
3.2.2 Configuration and data access
The MOPI installation includes the allocation of a disk
space in one or more of the nodes (allocating a dedicated
partitionis recommended). Then editingconfigurationfiles
which define the nodes having the data and the access path
to the MOPI files; and starting a number of daemons, e.g.,
MM and MOPID.
A client process wishing to access a MOPI file must
first open the file by sending a request to an MM. The
MM processes the open request and loads the MU, then
the process requests segment location(s), access the seg-
ment(s) directly via the MFS file system and may migrate
to the node(s) which have the segment(s) if the MOSIX al-
gorithms decide to do so.
3.3 MOPI support for MPI
MOPI supports the MPI-IO standard . An application
using the MPI-IO interface can run on top of MOPI with-
out any further modifications. In particular, we extended
the ROM-IO  implementation of the MPI-IO standard,
which includes support for NFS, UFS, PVFS, XFS, PFS,
SFS, PIOFS and HFS. In the next section we present the
performance of a benchmark which is written in MPI.
4 Performance of Parallel I/O with MOPI
This section presents the performance of MOPI using
benchmarks that simulate heavy file system loads, parallel
file system stress and tests to determine the optimal seg-
identical workstations, each with a Pentium III 1133MHz,
512MB RAM, a 20GB 7200 RPM IDE disk and a 100
the presentedresults reflect the average of 5 measurements.
For reference, we used the hdparm utility of Linux
and measured the average read rate of the disks at
38MB/Sec. We then used the Bonnie  benchmark
and measured the throughput of the (ext2) file system at
36.3MB/Sec for read and 41.89 MB/Sec for write. We note
that the write rate was higher than both of the above read
rates due to caching optimization. We also used the ttcp-
1.12 benchmark to measure the speed of TCP/IP between
pairs of nodes at an average rate of 11.17 MB/Sec for 8KB
blocks, with less than 0.5% variations for 4KB, 16KB and
4.1 Heavy file system loads
This benchmark demonstrates the performance of DFSA
with MFS  vs. NFS and local file access. We executed
the PostMark  benchmark, which simulates heavy file
system loads, e.g., as in a large Internet electronic mail
server. First, the benchmark creates a large pool of random
size text files, then it performs a sequence of transactions,
recursively, until a predefined workload is obtained. The
benchmark was performed between a pair of nodes using 3
different file access methods and block sizes, ranging from
1K byte to 64K bytes.
Table 1. File systems access times (Sec.)
MFS with DFSA
The results are presented in Table 1. The second col-
umn shows the Local times, when both the process and the
the MFS with DFSA times, i.e., the benchmarkstarted in a
client node then migrated to the server node; and the fourth
column shows the NFS times from a client (in its MOSIX
home) node to a server node.
From the results in Table 1 it follows that on average
(for all block sizes) MFS with DFSA is only 16.6% slower
than Local, and more than 10.3 times (900%) faster than
NFS. We note that this last result motivated the develop-
ment of MOPI.
4.2 Local vs. MOSIX vs. remote read
This test presents the performance of MOPI using the
IOR  parallel file system stress benchmark, developed
at LLNL. In this benchmark all the processes access the
same file using the MPI-IO interface. First, each process is
allocated to a different node (using MPI). Then each pro-
cess opens the file and seeks to a different segment. Then
one segment is written to a local disk, followed by a read
of another (remote) segment. This procedure is repeated
several times (a parameter), so that in each iteration each
process accesses a different segment and no two processes
access the same segment concurrently.
To test the performance of applications that scan
(read-once, e.g. filter) large amounts of data that is al-
ready present in different nodes, we ran one iteration of
the IOR benchmark (without the write phase). We mea-
sured the throughput rates (MB/Sec) of the read operation
when each node held one (1GB) data segment and ran one
process. The test was repeated for the following cases:
1. All-local: each process was created in the same node
where its data was located and accessed it locally.
2. Forced-migration all-local: each process was cre-
ated in a different node and was forced to migrate to
its respective data node as soon as it opened the file.
3. MOSIX: each process was created in a different node
and later migratedto its respective data nodeusing the
MOSIX automatic process distribution algorithms.
4. All-remote: each process was created in a different
node and accessed all its data via the network.
The results of these tests (aggregate throughput vs.
number of nodes) are shown in Figure 3. Obviously, the
highest rate, with a weighted average of 35.15 MB/Sec per
node, was obtained for the all-localtest. Note that this case
represents the throughput of the best possible static assign-
ment, in which each process access exactly one (local) seg-
ment. Clearly, this performance could not be sustained if
such a process access segments in different nodes.
Figure 3. Maximal vs. MOSIX vs. remote read rates
Number of Nodes
48 12 1620 24 28 323640 44 4852 5660
Aggregate Throughput (MB/Sec)
TCP/IP max bandwidth
Forced migration all local
Just below the all-local are the forced-migration all-
local access results, which represent the theoretical maxi-
mal throughput of MOSIX with a single migration of each
process. The obtained weighted average throughput was
32.68 MB/Sec per node, only 7.5% slower than the all-
local throughput. The weighted average throughput of the
MOSIXtest was27.39MB/Sec, about19%slowerthanthe
forced-migration all-local and 28% slower than the all-
local throughput. We note that the lower rates of MOSIX
are due to its stability algorithms, which prevent migra-
tion of short-lived processes. The lowest weighted average
throughput, of only 9.18 MB/Sec per node, was obtained
by the all-remote test, which represents disk striping over
the network. The results obtained were 28% slower than
the throughput of TCP/IP (shown for reference as a doted
line) and almost 3 times (198%) slower than the weighted
average throughput of MOSIX.
The respective maximal aggregate throughput with
60 nodes, were 2117 MB/Sec for the all-local test, 1945
MB/Sec for the forced-migration all-local, 1633 MB/Sec
for MOSIX and 525 MB/Sec for the all-remote tests.
Observe that due to the large capacity of our switch,
the results of all the tests show linear speedups. Clearly,
this would not be the case for the all-remote test, that may
saturate a network with several smaller switches, unlike the
MOSIX approach that scales up without heavy use of the
4.3 Write/read with several migrations
In this test we used the IOR benchmark (both write and
read) to measure the degradation of the I/O performance
of MOSIX due to several process migrations. First, we cre-
ated one process in each node. Thenthe test was conducted
with 1, 3 and 6 iterations, where in each iteration each pro-
cess wrote one (1GB) segment locally, then read another
segment from a remote node. Note that during the read
phase the MOSIX algorithms were expected to detect and
migrate each process to its respective data node.
Figure 4. I/O rates with several process migrations
Number of Nodes
48 12 16 20 242832 36 4044 48525660
Aggregate Throughput (MB/Sec)
TCP/IP max bandwidth
Figure 4 shows the results of this test (throughput
vs. number of nodes) with 1, 3 and 6 process migrations.
The respective weighted average throughput were 24.68
MB/Sec, 23.79 MB/Sec and 22.46 MB/Sec; and maxi-
mal aggregated throughput of 1477.64 MB/Sec, 1429.41
MB/Sec and 1335.34 MB/Sec respectively for 60 nodes.
From these results it follows that each process migration
resulted in a throughput loss of about 1.8% and aggregated
throughput loss of about 1.6% for 60 nodes.
4.4 Choice of the segment size
To help users select the optimal segment size, Figure 5
presents the I/O throughput rates for 16MB – 4GB seg-
ments, using the same set of tests as in Section 4.2 and a 60
node cluster. The throughput of the forced migration ob-
tained maximalperformancewith segmentsize of 128MB,
slightly lower than the aggregaterates of parallel read from
60 disks (shown as a doted line for reference). Below that
are the MOSIX results, which show a steady increase up
to a segment size of 2GB. As expected, the MOSIX results
convergeto the forcedmigrationresults when the segment
size is increased. Finally, just underthe TCP/IP line (which
reflects parallel read from a client to 60 servers) are the
all-remote results, which show maximal performance for
64-128 MB segments.
Figure 5. I/O rates, different segment sizes, 60 nodes
Segment Size (MByte)
32 64128 256512 10242048 4096
Aggregate Throughput (MB/Sec)
TCP/IP max bandwidth
Disk max bandwidth
Forced migration all−local
5 Conclusions and Future Work
This paper presented a new paradigm for cluster high I/O
performance,that combines data partitioningwith dynamic
work distribution. Our scheme supports parallel access to
segments of data by moving the processes to the data, to
benefit from local access, as opposed to the traditional way
of bringing the data to the processes. Our scheme scales
up linearly and it does not saturate the network. With the
increaseofthe segmentsize its performanceasymptotically
approach that of local disk access.
The work described in this paper could be extended
in several directions. First, it is possible to add support for
segment replication, to allow parallel read of the same seg-
ment by differentprocesses in differentnodes. Anotherop-
tion is to support memory-only segments, by which a large
file is pre-loadedto the main memoryof the clusters nodes,
thus allowing faster access to the data without any disk ac-
cess. Another extension could be to support GRID com-
justifyprocessmigrationto thedatanode. Finally, it will be
interestingto extendMOPI to supportsharedstorage facili-
ties, e.g. SAN or NAS, in which case the process migration
could be used to balance the load over the I/O channels of
We wish to thank Danny Braniss and Assaf Spanier for
their help. This research was supported in part by the Min-
istry of Defense and by a grant from Dr. and Mrs. Silver-
ston, Cambridge, UK.
 L. Amar, A. Barak, A. Eizenberg and A. Shiloh.
The MOSIX Scalable Cluster File Systems for Linux.
http://www.MOSIX.org, July 2000.
 K. Amiri, D. Petrou, G.R. Ganger and G.A. Gib-
son. Dynamic Function Placement for Data-intensive
Cluster Computing. Proc. USENIX AnnualTechnical
Conference, San Diego, CA, June 2000.
 T.E.Anderson,M.D. Dahlin,J.M. Neefe, D.A. Patter-
son, D.S Roselli and R.Y. Wang. Serverless network
file systems. ACM TCS, 14(1), pp. 41-79, 1996.
 A. Barak and A. Braverman. Memory Ushering in a
Scalable ComputingCluster. Journal of Microproces-
sors and Microsystems, 22(3-4), Aug. 1998.
 A. Barak, O. La’adan and A. Shiloh. Scalable Clus-
ter Computing with MOSIX for LINUX. Proc. 5-th
Annual Linux Expo, pp. 95-100, Atlanta, GA, 1999.
 M. Beynon , C. Chang, U. Catalyurek, T. Kurc, A.
Sussman, H. Andrade, R. Ferreira and J. Saltz. Pro-
cessing Large-Scale Multi-dimensional Data in Par-
allel and Distributed Environments. Parallel Comput-
ing, 28(5), pp. 827-859, 2002.
 P.H. Carns, W.B. Ligon III, R.B. Ross and R. Thakur.
PVFS: A Parallel File System For Linux Clusters.
Proc. 4-th Annual Linux Conference, pp. 317-327,
Atlanta, GA, 2000.
 Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S.
Kuo and K.E. Seamons. Exploiting Local Data in
Parallel Array I/O on a Practical Network of Work-
stations. Proc. 5-th Workshop on I/O in Parallel and
Distributed Systems, pp. 1-13, San Jose, CA, 1997.
 Global File System (GFS). http://www.sistina.com.
 J. Katcher. PostMark: A New File System Bench-
 MOSIX. http://www.MOSIX.org.
 P. Pacheco. Parallel Programmingwith MPI. Morgan
Kaufmann Pub. Inc., 1996.
 R. Thakur, W. Gropp and E. Lusk. On Implementing
MPI-IO Portably and with High Performance. Proc.
6-th Workshop on I/O in Parallel and Distributed Sys-
tems, pp. 23-32, 1999.
 The I/O Stress Benchmark Codes, ior mpiio (In-
terleaved or Random) benchmark. SIOP, LLNL.