The MOSIX Parallel I/O System for Scalable I/O Performance
Lior Amar, Amnon Barak and Amnon Shiloh
Institute of Computer Science
The Hebrew University of Jerusalem
Jerusalem, 91904 Israel
This paper presents the MOSIX Scalable Parallel In-
put/Output (MOPI) system that uses the process migration
capability of MOSIX for parallel access to segments of data
that are scattered among different nodes. MOSIX is a Unix
based, cluster operating system that already supports pre-
emptive process migration for load-balancing and memory
ushering. MOPI supports splitting ﬁles to several nodes. It
can deliver a high I/O performance by migrating parallel
processes to the respective nodes that hold the data, as op-
posed to the traditional way of bringing the ﬁle’s data to the
processes. The paper describes the MOSIX infrastructure
for supporting parallel ﬁle operations and the functions of
MOPI. It then presents the performance of MOPI for some
data intensive application and its scalability.
Cluster computing, MOSIX, parallel I/O, scalable I/O.
Cluster computing has been traditionally associated with
high performance computing of massive CPU bound ap-
plications. Clustered systems can offer other advantages
for “demanding” applications, such as the ability to sup-
port large processes that span across the main memory of
several nodes, or cluster ﬁle systems that support parallel
access to ﬁles [3, 8, 10]. Recently, there is a growing de-
mand to run data intensive applications, that can process
several orders of magnitude faster than any existing single
computer. This paper presents a cluster based parallel I/O
system that could be useful for such applications.
Some packages for parallel I/O in clusters are the
Global File System (GFS) , the Parallel Virtual File
System (PVFS) , Panda , DataCutter  and Aba-
cus  .
GFS is a shared ﬁle system suitable for clusters with
shared storage, it enables clients to access the same shared
storage while keeping the data consistent. Its main draw-
back is the high hardware costs. Its scalability is yet un-
PVFS is designed as a client-server system in which
ﬁles are transparently stripped across disks of multiple ﬁle-
servers. PVFS provides good performance but its main
drawbacks are its limited scalability (possibly due to the
Copyright (c) 2002 Amnon Barak. All rights reserved.
use of the network for almost all I/O operations) and lack
of data consistency when several processes interact with the
same ﬁle. PVFS is particularly popular in Linux clusters
for running intensive I/O parallel processes.
Panda  is a parallel I/O library for multidimen-
sional arrays that supports the strategy of “part-time I/O”,
where each node can be both a client and/or a server, simi-
larly to the MOSIX MFS scheme. However, unlike MOSIX
which optimize the network usage by migrating processes
closer to their data, Panda optimizes the performance by
initially selecting the I/O nodes (among the cluster nodes)
which will make the cluster use the minimal network re-
sources for a given description of anticipated I/O requests,
from clients to servers.
DataCutter  is a framework designed for devel-
oping data intensive applications in distributed environ-
ments. The programming model in DataCutter, called
“ﬁlter-stream programming”, represents components of
data-intensive applications as a set of ﬁlters. Each ﬁlter can
potentially be executed on a different host across a wide-
area network. Data exchange between any two ﬁlters is
described via streams, which are uni-directional pipes that
deliver data in ﬁxed size buffers. The idea of changing the
placement of program components to better use the cluster
resources is similar to MOSIX. However, in MOSIX there
is no need to change the applications, and MOSIX incor-
porates automatic load balancing, whereas in DataCutter
assignments are manual.
Abacus  is a programming model and a run time
system that monitors and dynamically changes function
placement for applications that manipulate large data sets.
The programming model encourages programmers to com-
pose data-intensive applications from explicitly-migratable
functions, while the run time system monitors those func-
tions and dynamically migrates them in order to im-
prove the application performance. This project is sim-
ilar to MOSIX: both perform dynamic load balancing of
processes/functions based on run-time statistics-collection.
The difference is that MOSIX works with generic programs
whereas Abacus requires a programming model.
This paper presents a new paradigm for cluster high
I/O performance that combines data partitioning with dy-
namic work distribution. The target cluster consists of mul-
tiple, homogeneous workstations and servers (nodes) that
work cooperatively, making the cluster-wide resources ac-
cessible to processes on all nodes. The cluster ﬁle sys-
tem consists of several subtrees that are placed in different
nodes to allow parallel operations on different ﬁles. The
key feature of our cluster I/O system is the ability to bring
(migrate) the process to the ﬁle server(s) rather than the tra-
ditional way of bringing the ﬁle’s data to the process. Pro-
cess migration, which is already supported by MOSIX 
for load-balancing, creates a new, scalable, parallel I/O ca-
pabilities that are suitable for applications that need to pro-
cess large volumes of data. We describe the MOSIX Par-
allel I/O (MOPI) library, which provides means for (trans-
parent) partitioning of ﬁles to different nodes and enables
parallel access to different segments of a ﬁle.
The organization of the paper is as follows: Section 2
gives a short overview of MOSIX and its relevant features
for supporting parallel I/O. Section 3 presents MOPI and
Section 4 presents its performance. Our conclusions are
given in Section 5.
2 MOSIX Background
MOSIX is a software that was speciﬁcally designed to en-
hance the Unix kernel with cluster computing capabili-
ties. The core of MOSIX are adaptive algorithms for load-
balancing  and memory ushering , that monitor un-
even resource distribution among the nodes and use pre-
emptive process migration to automatically reassign pro-
cesses among the nodes (like in an SMP) in order to contin-
uously take advantage of the best available resources. The
MOSIX algorithms are geared for maximal overall perfor-
mance, overhead-free scalability and ease-of-use.
2.1 The system image model
The granularity of the work distribution in MOSIX is the
Unix process. In MOSIX, each process has a unique home-
node (where it was created), which is usually the login node
of the user. The system image model is a computing clus-
ter, in which every process seems to run at its home-node
and all the processes of a users’ session share the running
environment of the home-node. Processes that migrate to a
remote (away from the home) node use local (in the remote
node) resources whenever possible but continue to inter-
act with the user’s environment by forwarding environment
dependent system-calls to the home-node.
2.2 Process migration
MOSIX supports preemptive (completely transparent) pro-
cess migration, that can migrate almost any process, any
time, to any available node. After a process is migrated all
its system-calls are intercepted by a link layer at the remote
node. If a system-call is site independent, it runs on the re-
mote node. Otherwise, the system-call is forwarded to the
home-node, where it runs on behalf of the process.
The above scheme is particularly useful for CPU-
bound processes but inefﬁcient for processes with inten-
sive I/O and/or ﬁle operations. This is due to the fact that
such processes are required to communicate with their re-
spective home-node environment for each I/O operation.
Clearly, in such cases these processes would be better off
not migrating. The next section describes a mechanism to
overcome this last problem.
2.3 The MOSIX File System (MFS)
We implemented the MOSIX File System (MFS)  which
provides a uniﬁed view of all ﬁles on all mounted ﬁle sys-
tems (of any type) in all the nodes of a MOSIX cluster, as
if they were all within a single ﬁle system. For example,
if one mounts MFS on the /mfs mount-point, then the
ﬁle /mfs/1456/usr/tmp/myfile refers to the ﬁle
/usr/tmp/myfile on node #1456. MFS is scalable
because the number of MOSIX nodes is practically unlim-
ited. In MFS, each node in a MOSIX cluster can simul-
taneously be a ﬁle server and run processes as clients and
each process can work with any mounted ﬁle system of any
type. One advantage of the MFS approach is that it allows
to raise the client-server interaction to the system-call level,
which provides total consistency.
2.4 Direct File System Access (DFSA)
DFSA  is a re-routing mechanism that reduces the ex-
tra overhead of running I/O oriented system-calls of a mi-
grated process. This is done by allowing most such system-
calls to perform locally - in the process’s current node. In
addition to DFSA, MOSIX monitors the I/O operations of
each process in order to encourage a process that performs
moderate to high volume of I/O to migrate to the node in
which it does most of its I/O. One obvious advantage is that
I/O-bound (and mixed I/O and CPU) processes have greater
ﬂexibility to migrate from their respective home-nodes for
2.5 Bringing the process to the ﬁle
Unlike most network ﬁle systems, which bring the data
from the ﬁle server to the client node over the network,
the MOSIX algorithms attempt to migrate the process to
the node in which the ﬁle resides. Usually most ﬁle opera-
tions are performed by a process on a single partition. The
MOSIX scheme has signiﬁcant advantages over other net-
worked ﬁle systems as it allows a process to migrate and
use any local partition. Clearly this eliminates the commu-
nication overhead between the process and the ﬁle server
(except the cost of the process migration itself). We note
that the process migration algorithms monitor and weigh
the amount of I/O operations vs. the size of each process,
in an attempt to optimize the decision whether to migrate
the process or not.
3 The Design and Implementation of MOPI
This section presents the design and implementation of
MOPI, a library that provides means for parallel access to
ﬁles. MOPI supports partitioning of large ﬁles to indepen-
dent data segments that are placed in different nodes. Client
applications can access such ﬁles transparently, without
any knowledge of the partitioning scheme or the fact that
the ﬁles are partitioned. The MOPI library requests in-
formation about a ﬁle from one or more dedicated servers,
called Meta Manager (MM), that are responsible for man-
aging large ﬁles, e.g., allocating new segments, removing
segments, etc. The MM uses a general purpose daemon
(MOPID) for such service requests.
Below we describe the partitioning method of ﬁles,
then present the MOPI implementation.
3.1 The MOPI ﬁle structure
A MOPI ﬁle consists of 2 parts: a Meta Unit and Data Seg-
ments, as shown in Figure 1.
Figure 1. The MOPI ﬁle structure
DS Size = 50MB
Storage On Node 1 Storage On Node 2 Storage On Node 3
3.1.1 The Meta Unit (MU)
The MU stores the ﬁle attributes, including the number and
ID of the nodes to which the ﬁle is partitioned; the size
of the partition unit (data segment) and the locations of the
segments. The functions and operations of the MU are sim-
ilar to the handling of i-nodes in UNIX ﬁle-systems; e.g., it
can be stored in a stable storage when the ﬁle is not active
(not accessed by any process) and it is loaded by the MM
when the ﬁle becomes active.
3.1.2 Data segments
Data is divided into Data Segments (DS). A DS is the
smallest unit of data for I/O optimization, e.g., when con-
sidering a migration of the process to the data. The seg-
ment size could vary from 1MB to almost 4GB, and all the
segments of the same ﬁle must be of the same size. Seg-
ments are created when the ﬁle is partitioned among the
nodes. After its creation, each DS exists as an autonomous
unit. When created, each segment is assigned to a node us-
ing a round robin scheme. We note that other partitioning
schemes could be implemented, e.g., a random or a space
conserving schemes, which balance the use of disk space
among the nodes. Also note that as in regular UNIX ﬁles,
gaps (holes) might exist in the ﬁles. In that case reading
from a hole returns 0’s.
3.2 The MOPI implementation
The prototype MOPI implementation consists of three
parts: a user level library that is linked with the user ap-
plication; a set of daemons for managing meta-data and
an optional set of utilities to manage basic operations on
MOPI ﬁles from the shell. See Figure 2 for details.
Figure 2. The MOPI components
Native Filesystem Native Filesystem
Node 1 Node 2
3.2.1 The MOPI interface
In order to use MOPI, the application should be modiﬁed to
use the MOPI functions, instead of the regular ﬁle system
calls.The MOPI interface includes all the common UNIX
ﬁle access and manipulation functions, such as mopi open,
mopi close, mopi write, mopi read, mopi readahead,
mopi lseek, mopi fcntl, mopi stat etc.
We note that mopi readahead is an asynchronous
function that return immediately, without ﬁnishing the ac-
tual read. It can be used to overlap computation with I/O
as well as to pre-load parts of ﬁles to the main memory of
nodes, then to migrate processes to the respective nodes to
beneﬁt from parallel, local access to data.
3.2.2 Conﬁguration and data access
The MOPI installation includes the allocation of a disk
space in one or more of the nodes (allocating a dedicated
partition is recommended). Then editing conﬁguration ﬁles
which deﬁne the nodes having the data and the access path
to the MOPI ﬁles; and starting a number of daemons, e.g.,
MM and MOPID.
A client process wishing to access a MOPI ﬁle must
ﬁrst open the ﬁle by sending a request to an MM. The
MM processes the open request and loads the MU, then
the process requests segment location(s), access the seg-
ment(s) directly via the MFS ﬁle system and may migrate
to the node(s) which have the segment(s) if the MOSIX al-
gorithms decide to do so.
3.3 MOPI support for MPI
MOPI supports the MPI-IO standard . An application
using the MPI-IO interface can run on top of MOPI with-
out any further modiﬁcations. In particular, we extended
the ROM-IO  implementation of the MPI-IO standard,
which includes support for NFS, UFS, PVFS, XFS, PFS,
SFS, PIOFS and HFS. In the next section we present the
performance of a benchmark which is written in MPI.
4 Performance of Parallel I/O with MOPI
This section presents the performance of MOPI using
benchmarks that simulate heavy ﬁle system loads, parallel
ﬁle system stress and tests to determine the optimal seg-
All measurements were performed in a cluster with 60
identical workstations, each with a Pentium III 1133MHz,
512MB RAM, a 20GB 7200 RPM IDE disk and a 100
Mb/Sec Ethernet NIC, that were connected by a 96X96 ma-
trix switch and ran under MOSIX  for Linux 2.4.18. All
the presented results reﬂect the average of 5 measurements.
For reference, we used the hdparm utility of Linux
and measured the average read rate of the disks at
38MB/Sec. We then used the Bonnie  benchmark
and measured the throughput of the (ext2) ﬁle system at
36.3MB/Sec for read and 41.89 MB/Sec for write. We note
that the write rate was higher than both of the above read
rates due to caching optimization. We also used the ttcp-
1.12 benchmark to measure the speed of TCP/IP between
pairs of nodes at an average rate of 11.17 MB/Sec for 8KB
blocks, with less than 0.5% variations for 4KB, 16KB and
4.1 Heavy ﬁle system loads
This benchmark demonstrates the performance of DFSA
with MFS  vs. NFS and local ﬁle access. We executed
the PostMark  benchmark, which simulates heavy ﬁle
system loads, e.g., as in a large Internet electronic mail
server. First, the benchmark creates a large pool of random
size text ﬁles, then it performs a sequence of transactions,
recursively, until a predeﬁned workload is obtained. The
benchmark was performed between a pair of nodes using 3
different ﬁle access methods and block sizes, ranging from
1K byte to 64K bytes.
Table 1. File systems access times (Sec.)
Block Local MFS with DFSA NFS
1K 14.4 18.0 162.0
4K 13.2 15.4 162.0
8K 13.0 15.0 161.6
16K 13.0 15.0 162.0
32K 13.8 15.4 162.8
64K 14.2 16.4 163.0
The results are presented in Table 1. The second col-
umn shows the Local times, when both the process and the
ﬁles were in the same (server) node; the third column shows
the MFS with DFSA times, i.e., the benchmark started in a
client node then migrated to the server node; and the fourth
column shows the NFS times from a client (in its MOSIX
home) node to a server node.
From the results in Table 1 it follows that on average
(for all block sizes) MFS with DFSA is only 16.6% slower
than Local, and more than 10.3 times (900%) faster than
NFS. We note that this last result motivated the develop-
ment of MOPI.
4.2 Local vs. MOSIX vs. remote read
This test presents the performance of MOPI using the
IOR  parallel ﬁle system stress benchmark, developed
at LLNL. In this benchmark all the processes access the
same ﬁle using the MPI-IO interface. First, each process is
allocated to a different node (using MPI). Then each pro-
cess opens the ﬁle and seeks to a different segment. Then
one segment is written to a local disk, followed by a read
of another (remote) segment. This procedure is repeated
several times (a parameter), so that in each iteration each
process accesses a different segment and no two processes
access the same segment concurrently.
To test the performance of applications that scan
(read-once, e.g. ﬁlter) large amounts of data that is al-
ready present in different nodes, we ran one iteration of
the IOR benchmark (without the write phase). We mea-
sured the throughput rates (MB/Sec) of the read operation
when each node held one (1GB) data segment and ran one
process. The test was repeated for the following cases:
1. All-local: each process was created in the same node
where its data was located and accessed it locally.
2. Forced-migration all-local: each process was cre-
ated in a different node and was forced to migrate to
its respective data node as soon as it opened the ﬁle.
3. MOSIX: each process was created in a different node
and later migrated to its respective data node using the
MOSIX automatic process distribution algorithms.
4. All-remote: each process was created in a different
node and accessed all its data via the network.
The results of these tests (aggregate throughput vs.
number of nodes) are shown in Figure 3. Obviously, the
highest rate, with a weighted average of 35.15 MB/Sec per
node, was obtained for the all-local test. Note that this case
represents the throughput of the best possible static assign-
ment, in which each process access exactly one (local) seg-
ment. Clearly, this performance could not be sustained if
such a process access segments in different nodes.
Figure 3. Maximal vs. MOSIX vs. remoteread rates
Number of Nodes
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Aggregate Throughput (MB/Sec)
TCP/IP max bandwidth
Forced migration all local
Just below the all-local are the forced-migration all-
local access results, which represent the theoretical maxi-
mal throughput of MOSIX with a single migration of each
process. The obtained weighted average throughput was
32.68 MB/Sec per node, only 7.5% slower than the all-
local throughput. The weighted average throughput of the
MOSIX test was 27.39 MB/Sec, about 19% slower than the
forced-migration all-local and 28% slower than the all-
local throughput. We note that the lower rates of MOSIX
are due to its stability algorithms, which prevent migra-
tion of short-lived processes. The lowest weighted average
throughput, of only 9.18 MB/Sec per node, was obtained
by the all-remote test, which represents disk striping over
the network. The results obtained were 28% slower than
the throughput of TCP/IP (shown for reference as a doted
line) and almost 3 times (198%) slower than the weighted
average throughput of MOSIX.
The respective maximal aggregate throughput with
60 nodes, were 2117 MB/Sec for the all-local test, 1945
MB/Sec for the forced-migration all-local, 1633 MB/Sec
for MOSIX and 525 MB/Sec for the all-remote tests.
Observe that due to the large capacity of our switch,
the results of all the tests show linear speedups. Clearly,
this would not be the case for the all-remote test, that may
saturate a network with several smaller switches, unlike the
MOSIX approach that scales up without heavy use of the
4.3 Write/read with several migrations
In this test we used the IOR benchmark (both write and
read) to measure the degradation of the I/O performance
of MOSIX due to several process migrations. First, we cre-
ated one process in each node. Then the test was conducted
with 1, 3 and 6 iterations, where in each iteration each pro-
cess wrote one (1GB) segment locally, then read another
segment from a remote node. Note that during the read
phase the MOSIX algorithms were expected to detect and
migrate each process to its respective data node.
Figure 4. I/O rates with several process migrations
Number of Nodes
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Aggregate Throughput (MB/Sec)
TCP/IP max bandwidth
Figure 4 shows the results of this test (throughput
vs. number of nodes) with 1, 3 and 6 process migrations.
The respective weighted average throughput were 24.68
MB/Sec, 23.79 MB/Sec and 22.46 MB/Sec; and maxi-
mal aggregated throughput of 1477.64 MB/Sec, 1429.41
MB/Sec and 1335.34 MB/Sec respectively for 60 nodes.
From these results it follows that each process migration
resulted in a throughput loss of about 1.8% and aggregated
throughput loss of about 1.6% for 60 nodes.
4.4 Choice of the segment size
To help users select the optimal segment size, Figure 5
presents the I/O throughput rates for 16MB – 4GB seg-
ments, using the same set of tests as in Section 4.2 and a 60
node cluster. The throughput of the forced migration ob-
tained maximal performance with segment size of 128 MB,
slightly lower than the aggregate rates of parallel read from
60 disks (shown as a doted line for reference). Below that
are the MOSIX results, which show a steady increase up
to a segment size of 2GB. As expected, the MOSIX results
converge to the forced migrationresults when the segment
size is increased. Finally, just under the TCP/IP line (which
reﬂects parallel read from a client to 60 servers) are the
all-remote results, which show maximal performance for
64-128 MB segments.
Figure 5. I/O rates, different segment sizes, 60 nodes
Segment Size (MByte)
16 32 64 128 256 512 1024 2048 4096
Aggregate Throughput (MB/Sec)
TCP/IP max bandwidth
Disk max bandwidth
Forced migration all−local
5 Conclusions and Future Work
This paper presented a new paradigm for cluster high I/O
performance, that combines data partitioning with dynamic
work distribution. Our scheme supports parallel access to
segments of data by moving the processes to the data, to
beneﬁt from local access, as opposed to the traditional way
of bringing the data to the processes. Our scheme scales
up linearly and it does not saturate the network. With the
increase of the segment size its performance asymptotically
approach that of local disk access.
The work described in this paper could be extended
in several directions. First, it is possible to add support for
segment replication, to allow parallel read of the same seg-
ment by different processes in different nodes. Another op-
tion is to support memory-only segments, by which a large
ﬁle is pre-loaded to the main memory of the clusters nodes,
thus allowing faster access to the data without any disk ac-
cess. Another extension could be to support GRID com-
puting, which due to unpredictable communication patterns
justify process migration to the data node. Finally, it will be
interesting to extend MOPI to support shared storage facili-
ties, e.g. SAN or NAS, in which case the process migration
could be used to balance the load over the I/O channels of
We wish to thank Danny Braniss and Assaf Spanier for
their help. This research was supported in part by the Min-
istry of Defense and by a grant from Dr. and Mrs. Silver-
ston, Cambridge, UK.
 L. Amar, A. Barak, A. Eizenberg and A. Shiloh.
The MOSIX Scalable Cluster File Systems for Linux.
http://www.MOSIX.org, July 2000.
 K. Amiri, D. Petrou, G.R. Ganger and G.A. Gib-
son. Dynamic Function Placement for Data-intensive
Cluster Computing. Proc. USENIX Annual Technical
Conference, San Diego, CA, June 2000.
 T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patter-
son, D.S Roselli and R.Y. Wang. Serverless network
ﬁle systems. ACM TCS, 14(1), pp. 41-79, 1996.
 A. Barak and A. Braverman. Memory Ushering in a
Scalable Computing Cluster. Journal of Microproces-
sors and Microsystems, 22(3-4), Aug. 1998.
 A. Barak, O. La’adan and A. Shiloh. Scalable Clus-
ter Computing with MOSIX for LINUX. Proc. 5-th
Annual Linux Expo, pp. 95-100, Atlanta, GA, 1999.
 M. Beynon , C. Chang, U. Catalyurek, T. Kurc, A.
Sussman, H. Andrade, R. Ferreira and J. Saltz. Pro-
cessing Large-Scale Multi-dimensional Data in Par-
allel and Distributed Environments. Parallel Comput-
ing, 28(5), pp. 827-859, 2002.
 Bonnie File System Benchmark
 P.H. Carns, W.B. Ligon III, R.B. Ross and R. Thakur.
PVFS: A Parallel File System For Linux Clusters.
Proc. 4-th Annual Linux Conference, pp. 317-327,
Atlanta, GA, 2000.
 Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S.
Kuo and K.E. Seamons. Exploiting Local Data in
Parallel Array I/O on a Practical Network of Work-
stations. Proc. 5-th Workshop on I/O in Parallel and
Distributed Systems, pp. 1-13, San Jose, CA, 1997.
 Global File System (GFS). http://www.sistina.com.
 J. Katcher. PostMark: A New File System Bench-
 MOSIX. http://www.MOSIX.org.
 P. Pacheco. Parallel Programming with MPI. Morgan
Kaufmann Pub. Inc., 1996.
 R. Thakur, W. Gropp and E. Lusk. On Implementing
MPI-IO Portably and with High Performance. Proc.
6-th Workshop on I/O in Parallel and Distributed Sys-
tems, pp. 23-32, 1999.
 The I/O Stress Benchmark Codes, ior mpiio (In-
terleaved or Random) benchmark. SIOP, LLNL.