Conference PaperPDF Available

The MOSIX Parallel I/O System for Scalable I/O Performance.

Authors:
The MOSIX Parallel I/O System for Scalable I/O Performance
Lior Amar, Amnon Barak and Amnon Shiloh
Institute of Computer Science
The Hebrew University of Jerusalem
Jerusalem, 91904 Israel
ABSTRACT
This paper presents the MOSIX Scalable Parallel In-
put/Output (MOPI) system that uses the process migration
capability of MOSIX for parallel access to segments of data
that are scattered among different nodes. MOSIX is a Unix
based, cluster operating system that already supports pre-
emptive process migration for load-balancing and memory
ushering. MOPI supports splitting files to several nodes. It
can deliver a high I/O performance by migrating parallel
processes to the respective nodes that hold the data, as op-
posed to the traditional way of bringing the file’s data to the
processes. The paper describes the MOSIX infrastructure
for supporting parallel file operations and the functions of
MOPI. It then presents the performance of MOPI for some
data intensive application and its scalability.
KEY WORDS
Cluster computing, MOSIX, parallel I/O, scalable I/O.
1 Introduction
Cluster computing has been traditionally associated with
high performance computing of massive CPU bound ap-
plications. Clustered systems can offer other advantages
for “demanding” applications, such as the ability to sup-
port large processes that span across the main memory of
several nodes, or cluster file systems that support parallel
access to files [3, 8, 10]. Recently, there is a growing de-
mand to run data intensive applications, that can process
several orders of magnitude faster than any existing single
computer. This paper presents a cluster based parallel I/O
system that could be useful for such applications.
Some packages for parallel I/O in clusters are the
Global File System (GFS) [10], the Parallel Virtual File
System (PVFS) [8], Panda [9], DataCutter [6] and Aba-
cus [2] .
GFS is a shared file system suitable for clusters with
shared storage, it enables clients to access the same shared
storage while keeping the data consistent. Its main draw-
back is the high hardware costs. Its scalability is yet un-
known.
PVFS is designed as a client-server system in which
files are transparently stripped across disks of multiple file-
servers. PVFS provides good performance but its main
drawbacks are its limited scalability (possibly due to the
Copyright (c) 2002 Amnon Barak. All rights reserved.
use of the network for almost all I/O operations) and lack
of data consistency when several processes interact with the
same file. PVFS is particularly popular in Linux clusters
for running intensive I/O parallel processes.
Panda [9] is a parallel I/O library for multidimen-
sional arrays that supports the strategy of “part-time I/O”,
where each node can be both a client and/or a server, simi-
larly to the MOSIX MFS scheme. However, unlike MOSIX
which optimize the network usage by migrating processes
closer to their data, Panda optimizes the performance by
initially selecting the I/O nodes (among the cluster nodes)
which will make the cluster use the minimal network re-
sources for a given description of anticipated I/O requests,
from clients to servers.
DataCutter [6] is a framework designed for devel-
oping data intensive applications in distributed environ-
ments. The programming model in DataCutter, called
“filter-stream programming”, represents components of
data-intensive applications as a set of filters. Each filter can
potentially be executed on a different host across a wide-
area network. Data exchange between any two filters is
described via streams, which are uni-directional pipes that
deliver data in fixed size buffers. The idea of changing the
placement of program components to better use the cluster
resources is similar to MOSIX. However, in MOSIX there
is no need to change the applications, and MOSIX incor-
porates automatic load balancing, whereas in DataCutter
assignments are manual.
Abacus [2] is a programming model and a run time
system that monitors and dynamically changes function
placement for applications that manipulate large data sets.
The programming model encourages programmers to com-
pose data-intensive applications from explicitly-migratable
functions, while the run time system monitors those func-
tions and dynamically migrates them in order to im-
prove the application performance. This project is sim-
ilar to MOSIX: both perform dynamic load balancing of
processes/functions based on run-time statistics-collection.
The difference is that MOSIX works with generic programs
whereas Abacus requires a programming model.
This paper presents a new paradigm for cluster high
I/O performance that combines data partitioning with dy-
namic work distribution. The target cluster consists of mul-
tiple, homogeneous workstations and servers (nodes) that
work cooperatively, making the cluster-wide resources ac-
cessible to processes on all nodes. The cluster file sys-
tem consists of several subtrees that are placed in different
nodes to allow parallel operations on different files. The
key feature of our cluster I/O system is the ability to bring
(migrate) the process to the file server(s) rather than the tra-
ditional way of bringing the file’s data to the process. Pro-
cess migration, which is already supported by MOSIX [12]
for load-balancing, creates a new, scalable, parallel I/O ca-
pabilities that are suitable for applications that need to pro-
cess large volumes of data. We describe the MOSIX Par-
allel I/O (MOPI) library, which provides means for (trans-
parent) partitioning of files to different nodes and enables
parallel access to different segments of a file.
The organization of the paper is as follows: Section 2
gives a short overview of MOSIX and its relevant features
for supporting parallel I/O. Section 3 presents MOPI and
Section 4 presents its performance. Our conclusions are
given in Section 5.
2 MOSIX Background
MOSIX is a software that was specifically designed to en-
hance the Unix kernel with cluster computing capabili-
ties. The core of MOSIX are adaptive algorithms for load-
balancing [5] and memory ushering [4], that monitor un-
even resource distribution among the nodes and use pre-
emptive process migration to automatically reassign pro-
cesses among the nodes (like in an SMP) in order to contin-
uously take advantage of the best available resources. The
MOSIX algorithms are geared for maximal overall perfor-
mance, overhead-free scalability and ease-of-use.
2.1 The system image model
The granularity of the work distribution in MOSIX is the
Unix process. In MOSIX, each process has a unique home-
node (where it was created), which is usually the login node
of the user. The system image model is a computing clus-
ter, in which every process seems to run at its home-node
and all the processes of a users’ session share the running
environment of the home-node. Processes that migrate to a
remote (away from the home) node use local (in the remote
node) resources whenever possible but continue to inter-
act with the user’s environment by forwarding environment
dependent system-calls to the home-node.
2.2 Process migration
MOSIX supports preemptive (completely transparent) pro-
cess migration, that can migrate almost any process, any
time, to any available node. After a process is migrated all
its system-calls are intercepted by a link layer at the remote
node. If a system-call is site independent, it runs on the re-
mote node. Otherwise, the system-call is forwarded to the
home-node, where it runs on behalf of the process.
The above scheme is particularly useful for CPU-
bound processes but inefficient for processes with inten-
sive I/O and/or file operations. This is due to the fact that
such processes are required to communicate with their re-
spective home-node environment for each I/O operation.
Clearly, in such cases these processes would be better off
not migrating. The next section describes a mechanism to
overcome this last problem.
2.3 The MOSIX File System (MFS)
We implemented the MOSIX File System (MFS) [1] which
provides a unified view of all files on all mounted file sys-
tems (of any type) in all the nodes of a MOSIX cluster, as
if they were all within a single file system. For example,
if one mounts MFS on the /mfs mount-point, then the
file /mfs/1456/usr/tmp/myfile refers to the file
/usr/tmp/myfile on node #1456. MFS is scalable
because the number of MOSIX nodes is practically unlim-
ited. In MFS, each node in a MOSIX cluster can simul-
taneously be a file server and run processes as clients and
each process can work with any mounted file system of any
type. One advantage of the MFS approach is that it allows
to raise the client-server interaction to the system-call level,
which provides total consistency.
2.4 Direct File System Access (DFSA)
DFSA [1] is a re-routing mechanism that reduces the ex-
tra overhead of running I/O oriented system-calls of a mi-
grated process. This is done by allowing most such system-
calls to perform locally - in the process’s current node. In
addition to DFSA, MOSIX monitors the I/O operations of
each process in order to encourage a process that performs
moderate to high volume of I/O to migrate to the node in
which it does most of its I/O. One obvious advantage is that
I/O-bound (and mixed I/O and CPU) processes have greater
flexibility to migrate from their respective home-nodes for
better load-balancing.
2.5 Bringing the process to the file
Unlike most network file systems, which bring the data
from the file server to the client node over the network,
the MOSIX algorithms attempt to migrate the process to
the node in which the file resides. Usually most file opera-
tions are performed by a process on a single partition. The
MOSIX scheme has significant advantages over other net-
worked file systems as it allows a process to migrate and
use any local partition. Clearly this eliminates the commu-
nication overhead between the process and the file server
(except the cost of the process migration itself). We note
that the process migration algorithms monitor and weigh
the amount of I/O operations vs. the size of each process,
in an attempt to optimize the decision whether to migrate
the process or not.
2
3 The Design and Implementation of MOPI
This section presents the design and implementation of
MOPI, a library that provides means for parallel access to
files. MOPI supports partitioning of large files to indepen-
dent data segments that are placed in different nodes. Client
applications can access such files transparently, without
any knowledge of the partitioning scheme or the fact that
the files are partitioned. The MOPI library requests in-
formation about a file from one or more dedicated servers,
called Meta Manager (MM), that are responsible for man-
aging large files, e.g., allocating new segments, removing
segments, etc. The MM uses a general purpose daemon
(MOPID) for such service requests.
Below we describe the partitioning method of files,
then present the MOPI implementation.
3.1 The MOPI file structure
A MOPI file consists of 2 parts: a Meta Unit and Data Seg-
ments, as shown in Figure 1.
Figure 1. The MOPI file structure
Meta Unit
DS Size = 50MB
DS 11
DS 12
DS 10
DS 9
DS 6
DS 3
DS 7
Storage On Node 1 Storage On Node 2 Storage On Node 3
Data
Segment
DS 1
DS 4
DS 8
DS 2
3.1.1 The Meta Unit (MU)
The MU stores the file attributes, including the number and
ID of the nodes to which the file is partitioned; the size
of the partition unit (data segment) and the locations of the
segments. The functions and operations of the MU are sim-
ilar to the handling of i-nodes in UNIX file-systems; e.g., it
can be stored in a stable storage when the file is not active
(not accessed by any process) and it is loaded by the MM
when the file becomes active.
3.1.2 Data segments
Data is divided into Data Segments (DS). A DS is the
smallest unit of data for I/O optimization, e.g., when con-
sidering a migration of the process to the data. The seg-
ment size could vary from 1MB to almost 4GB, and all the
segments of the same file must be of the same size. Seg-
ments are created when the file is partitioned among the
nodes. After its creation, each DS exists as an autonomous
unit. When created, each segment is assigned to a node us-
ing a round robin scheme. We note that other partitioning
schemes could be implemented, e.g., a random or a space
conserving schemes, which balance the use of disk space
among the nodes. Also note that as in regular UNIX files,
gaps (holes) might exist in the files. In that case reading
from a hole returns 0’s.
3.2 The MOPI implementation
The prototype MOPI implementation consists of three
parts: a user level library that is linked with the user ap-
plication; a set of daemons for managing meta-data and
an optional set of utilities to manage basic operations on
MOPI files from the shell. See Figure 2 for details.
Figure 2. The MOPI components
Application
MPI Library
MOPI Library
Service
Daemon
MFS
Application
MPI Library
MOPI Library
Service
Daemon
MFS
Garbage Collector
MetaData Manager
Native Filesystem Native Filesystem
Node 1 Node 2
Storage Storage
3.2.1 The MOPI interface
In order to use MOPI, the application should be modified to
use the MOPI functions, instead of the regular file system
calls.The MOPI interface includes all the common UNIX
file access and manipulation functions, such as mopi open,
mopi close, mopi write, mopi read, mopi readahead,
mopi lseek, mopi fcntl, mopi stat etc.
We note that mopi readahead is an asynchronous
function that return immediately, without finishing the ac-
tual read. It can be used to overlap computation with I/O
as well as to pre-load parts of files to the main memory of
nodes, then to migrate processes to the respective nodes to
benefit from parallel, local access to data.
3
3.2.2 Configuration and data access
The MOPI installation includes the allocation of a disk
space in one or more of the nodes (allocating a dedicated
partition is recommended). Then editing configuration files
which define the nodes having the data and the access path
to the MOPI files; and starting a number of daemons, e.g.,
MM and MOPID.
A client process wishing to access a MOPI file must
first open the file by sending a request to an MM. The
MM processes the open request and loads the MU, then
the process requests segment location(s), access the seg-
ment(s) directly via the MFS file system and may migrate
to the node(s) which have the segment(s) if the MOSIX al-
gorithms decide to do so.
3.3 MOPI support for MPI
MOPI supports the MPI-IO standard [13]. An application
using the MPI-IO interface can run on top of MOPI with-
out any further modifications. In particular, we extended
the ROM-IO [14] implementation of the MPI-IO standard,
which includes support for NFS, UFS, PVFS, XFS, PFS,
SFS, PIOFS and HFS. In the next section we present the
performance of a benchmark which is written in MPI.
4 Performance of Parallel I/O with MOPI
This section presents the performance of MOPI using
benchmarks that simulate heavy file system loads, parallel
file system stress and tests to determine the optimal seg-
ment size.
All measurements were performed in a cluster with 60
identical workstations, each with a Pentium III 1133MHz,
512MB RAM, a 20GB 7200 RPM IDE disk and a 100
Mb/Sec Ethernet NIC, that were connected by a 96X96 ma-
trix switch and ran under MOSIX [12] for Linux 2.4.18. All
the presented results reflect the average of 5 measurements.
For reference, we used the hdparm utility of Linux
and measured the average read rate of the disks at
38MB/Sec. We then used the Bonnie [7] benchmark
and measured the throughput of the (ext2) file system at
36.3MB/Sec for read and 41.89 MB/Sec for write. We note
that the write rate was higher than both of the above read
rates due to caching optimization. We also used the ttcp-
1.12 benchmark to measure the speed of TCP/IP between
pairs of nodes at an average rate of 11.17 MB/Sec for 8KB
blocks, with less than 0.5% variations for 4KB, 16KB and
32KB blocks.
4.1 Heavy file system loads
This benchmark demonstrates the performance of DFSA
with MFS [1] vs. NFS and local file access. We executed
the PostMark [11] benchmark, which simulates heavy file
system loads, e.g., as in a large Internet electronic mail
server. First, the benchmark creates a large pool of random
size text files, then it performs a sequence of transactions,
recursively, until a predefined workload is obtained. The
benchmark was performed between a pair of nodes using 3
different file access methods and block sizes, ranging from
1K byte to 64K bytes.
Table 1. File systems access times (Sec.)
Access method
Block Local MFS with DFSA NFS
1K 14.4 18.0 162.0
4K 13.2 15.4 162.0
8K 13.0 15.0 161.6
16K 13.0 15.0 162.0
32K 13.8 15.4 162.8
64K 14.2 16.4 163.0
The results are presented in Table 1. The second col-
umn shows the Local times, when both the process and the
files were in the same (server) node; the third column shows
the MFS with DFSA times, i.e., the benchmark started in a
client node then migrated to the server node; and the fourth
column shows the NFS times from a client (in its MOSIX
home) node to a server node.
From the results in Table 1 it follows that on average
(for all block sizes) MFS with DFSA is only 16.6% slower
than Local, and more than 10.3 times (900%) faster than
NFS. We note that this last result motivated the develop-
ment of MOPI.
4.2 Local vs. MOSIX vs. remote read
This test presents the performance of MOPI using the
IOR [15] parallel file system stress benchmark, developed
at LLNL. In this benchmark all the processes access the
same file using the MPI-IO interface. First, each process is
allocated to a different node (using MPI). Then each pro-
cess opens the file and seeks to a different segment. Then
one segment is written to a local disk, followed by a read
of another (remote) segment. This procedure is repeated
several times (a parameter), so that in each iteration each
process accesses a different segment and no two processes
access the same segment concurrently.
To test the performance of applications that scan
(read-once, e.g. filter) large amounts of data that is al-
ready present in different nodes, we ran one iteration of
the IOR benchmark (without the write phase). We mea-
sured the throughput rates (MB/Sec) of the read operation
when each node held one (1GB) data segment and ran one
process. The test was repeated for the following cases:
1. All-local: each process was created in the same node
where its data was located and accessed it locally.
4
2. Forced-migration all-local: each process was cre-
ated in a different node and was forced to migrate to
its respective data node as soon as it opened the file.
3. MOSIX: each process was created in a different node
and later migrated to its respective data node using the
MOSIX automatic process distribution algorithms.
4. All-remote: each process was created in a different
node and accessed all its data via the network.
The results of these tests (aggregate throughput vs.
number of nodes) are shown in Figure 3. Obviously, the
highest rate, with a weighted average of 35.15 MB/Sec per
node, was obtained for the all-local test. Note that this case
represents the throughput of the best possible static assign-
ment, in which each process access exactly one (local) seg-
ment. Clearly, this performance could not be sustained if
such a process access segments in different nodes.
Figure 3. Maximal vs. MOSIX vs. remoteread rates
Number of Nodes
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Aggregate Throughput (MB/Sec)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
TCP/IP max bandwidth
All−local
Forced migration all local
MOSIX
All−remote
Just below the all-local are the forced-migration all-
local access results, which represent the theoretical maxi-
mal throughput of MOSIX with a single migration of each
process. The obtained weighted average throughput was
32.68 MB/Sec per node, only 7.5% slower than the all-
local throughput. The weighted average throughput of the
MOSIX test was 27.39 MB/Sec, about 19% slower than the
forced-migration all-local and 28% slower than the all-
local throughput. We note that the lower rates of MOSIX
are due to its stability algorithms, which prevent migra-
tion of short-lived processes. The lowest weighted average
throughput, of only 9.18 MB/Sec per node, was obtained
by the all-remote test, which represents disk striping over
the network. The results obtained were 28% slower than
the throughput of TCP/IP (shown for reference as a doted
line) and almost 3 times (198%) slower than the weighted
average throughput of MOSIX.
The respective maximal aggregate throughput with
60 nodes, were 2117 MB/Sec for the all-local test, 1945
MB/Sec for the forced-migration all-local, 1633 MB/Sec
for MOSIX and 525 MB/Sec for the all-remote tests.
Observe that due to the large capacity of our switch,
the results of all the tests show linear speedups. Clearly,
this would not be the case for the all-remote test, that may
saturate a network with several smaller switches, unlike the
MOSIX approach that scales up without heavy use of the
network.
4.3 Write/read with several migrations
In this test we used the IOR benchmark (both write and
read) to measure the degradation of the I/O performance
of MOSIX due to several process migrations. First, we cre-
ated one process in each node. Then the test was conducted
with 1, 3 and 6 iterations, where in each iteration each pro-
cess wrote one (1GB) segment locally, then read another
segment from a remote node. Note that during the read
phase the MOSIX algorithms were expected to detect and
migrate each process to its respective data node.
Figure 4. I/O rates with several process migrations
Number of Nodes
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Aggregate Throughput (MB/Sec)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
TCP/IP max bandwidth
MOSIX−1 migration
MOSIX−3 migrations
MOSIX−6 migrations
Figure 4 shows the results of this test (throughput
vs. number of nodes) with 1, 3 and 6 process migrations.
The respective weighted average throughput were 24.68
MB/Sec, 23.79 MB/Sec and 22.46 MB/Sec; and maxi-
mal aggregated throughput of 1477.64 MB/Sec, 1429.41
MB/Sec and 1335.34 MB/Sec respectively for 60 nodes.
From these results it follows that each process migration
resulted in a throughput loss of about 1.8% and aggregated
throughput loss of about 1.6% for 60 nodes.
4.4 Choice of the segment size
To help users select the optimal segment size, Figure 5
presents the I/O throughput rates for 16MB – 4GB seg-
ments, using the same set of tests as in Section 4.2 and a 60
node cluster. The throughput of the forced migration ob-
tained maximal performance with segment size of 128 MB,
5
slightly lower than the aggregate rates of parallel read from
60 disks (shown as a doted line for reference). Below that
are the MOSIX results, which show a steady increase up
to a segment size of 2GB. As expected, the MOSIX results
converge to the forced migrationresults when the segment
size is increased. Finally, just under the TCP/IP line (which
reflects parallel read from a client to 60 servers) are the
all-remote results, which show maximal performance for
64-128 MB segments.
Figure 5. I/O rates, different segment sizes, 60 nodes
Segment Size (MByte)
16 32 64 128 256 512 1024 2048 4096
Aggregate Throughput (MB/Sec)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
TCP/IP max bandwidth
Disk max bandwidth
Forced migration all−local
MOSIX
All−remote
5 Conclusions and Future Work
This paper presented a new paradigm for cluster high I/O
performance, that combines data partitioning with dynamic
work distribution. Our scheme supports parallel access to
segments of data by moving the processes to the data, to
benefit from local access, as opposed to the traditional way
of bringing the data to the processes. Our scheme scales
up linearly and it does not saturate the network. With the
increase of the segment size its performance asymptotically
approach that of local disk access.
The work described in this paper could be extended
in several directions. First, it is possible to add support for
segment replication, to allow parallel read of the same seg-
ment by different processes in different nodes. Another op-
tion is to support memory-only segments, by which a large
file is pre-loaded to the main memory of the clusters nodes,
thus allowing faster access to the data without any disk ac-
cess. Another extension could be to support GRID com-
puting, which due to unpredictable communication patterns
justify process migration to the data node. Finally, it will be
interesting to extend MOPI to support shared storage facili-
ties, e.g. SAN or NAS, in which case the process migration
could be used to balance the load over the I/O channels of
the nodes.
Acknowledgements
We wish to thank Danny Braniss and Assaf Spanier for
their help. This research was supported in part by the Min-
istry of Defense and by a grant from Dr. and Mrs. Silver-
ston, Cambridge, UK.
References
[1] L. Amar, A. Barak, A. Eizenberg and A. Shiloh.
The MOSIX Scalable Cluster File Systems for Linux.
http://www.MOSIX.org, July 2000.
[2] K. Amiri, D. Petrou, G.R. Ganger and G.A. Gib-
son. Dynamic Function Placement for Data-intensive
Cluster Computing. Proc. USENIX Annual Technical
Conference, San Diego, CA, June 2000.
[3] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patter-
son, D.S Roselli and R.Y. Wang. Serverless network
file systems. ACM TCS, 14(1), pp. 41-79, 1996.
[4] A. Barak and A. Braverman. Memory Ushering in a
Scalable Computing Cluster. Journal of Microproces-
sors and Microsystems, 22(3-4), Aug. 1998.
[5] A. Barak, O. La’adan and A. Shiloh. Scalable Clus-
ter Computing with MOSIX for LINUX. Proc. 5-th
Annual Linux Expo, pp. 95-100, Atlanta, GA, 1999.
[6] M. Beynon , C. Chang, U. Catalyurek, T. Kurc, A.
Sussman, H. Andrade, R. Ferreira and J. Saltz. Pro-
cessing Large-Scale Multi-dimensional Data in Par-
allel and Distributed Environments. Parallel Comput-
ing, 28(5), pp. 827-859, 2002.
[7] Bonnie File System Benchmark
http://www.textuality.com/bonnie.
[8] P.H. Carns, W.B. Ligon III, R.B. Ross and R. Thakur.
PVFS: A Parallel File System For Linux Clusters.
Proc. 4-th Annual Linux Conference, pp. 317-327,
Atlanta, GA, 2000.
[9] Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S.
Kuo and K.E. Seamons. Exploiting Local Data in
Parallel Array I/O on a Practical Network of Work-
stations. Proc. 5-th Workshop on I/O in Parallel and
Distributed Systems, pp. 1-13, San Jose, CA, 1997.
[10] Global File System (GFS). http://www.sistina.com.
[11] J. Katcher. PostMark: A New File System Bench-
mark. http://www.netapp.com.
[12] MOSIX. http://www.MOSIX.org.
[13] P. Pacheco. Parallel Programming with MPI. Morgan
Kaufmann Pub. Inc., 1996.
[14] R. Thakur, W. Gropp and E. Lusk. On Implementing
MPI-IO Portably and with High Performance. Proc.
6-th Workshop on I/O in Parallel and Distributed Sys-
tems, pp. 23-32, 1999.
[15] The I/O Stress Benchmark Codes, ior mpiio (In-
terleaved or Random) benchmark. SIOP, LLNL.
http://www.llnl.gov/asci/purple/benchmarks/limited
/ior.
6
... Since midnight initially will be used to prove that the part time cluster concept works we decided against implementing a single I/O Space such as the Parallel Virtual file system (PVFS) [23,24], Global file system (GFS) [25] or Mosix Shared File system (MFS) [26 ]. Parallel file systems have been shown to increase the I/O capabilities of workstations and clusters [27] but in this early stage a parallel file system would only increase the complexity of the system. ...
... The drawback of this system is that all site dependent system calls that are executed by the process are forwarded from the remote to the deputy process and executed on the UHN [14]. This adds some overhead especially when a remote process needs to access the network or a file system, unless the Mosix distributed file system have been implemented [27]. ...
Conference Paper
Full-text available
Abs tract In this paper, we present the creation process and the purpose behind the Midnight cluster. It is disguised as a computer laboratory during the day, but turns into a high performance compute cluster during the night. The main focus of this paper is on the basic issues with creating a part time compute cluster. The Midnight cluster was constructed to serve both as a CPU harvester and as a platform for further studies. The main goal of our upcoming research is to ev aluate different methods for handling shape- change and process management on the cluster, and how these factors affect running processes, stability and performance.
... The algorithms in MOSIX support load-balancing [5], memory ushering [6], parallel I/O [7] and cluster-wide file operations [7] These algorithms monitor uneven resource usage among the nodes and if necessary, assign and reassign processes (automatically) among the nodes in order to continuously take advantage of the best available resources. ...
... The algorithms in MOSIX support load-balancing [5], memory ushering [6], parallel I/O [7] and cluster-wide file operations [7] These algorithms monitor uneven resource usage among the nodes and if necessary, assign and reassign processes (automatically) among the nodes in order to continuously take advantage of the best available resources. ...
Conference Paper
Full-text available
EnFuzion and MOSIX are two packages that represent different approaches to cluster management. EnFuzion is a user-level queuing system that can dispatch a predetermined number of processes to a cluster. It is a commercial version of Nimrod , a tool that supports parameter sweep applications on a variety of platforms. MOSIX, on the other hand, is operating system (kernel) level software that supports preemptive process migration for near optimal, cluster-wide resource management, virtually makings the cluster run like an SMP. Traditionally, users either use EnFuzion with a conventional cluster operating system, or MOSIX without a queue manager. This paper presents a Grid management system that combines EnFuzion with MOSIX for efficient management of processes in multiple clusters. We present a range of experiments that demonstrate the advantages of such a combination, including a real world case study that distributed a computational model of a solar system.
... L'un des premiers systèmes fut Sprite [65], qui combine la gestion globale des processusà un système de fichiers distribués. Mosix [14,5] et 51 Condor [53,15] ont développé une architecture pour déplacer les processus d'un noeud a un autre de manière transparente pour le programme et l'utilisateur. Dans Mosix, la politique de placement/déplacement des processus est liéeà la charge CPU et mémoire des noeuds. ...
Article
This thesis presents the design of a communication service dedicated to the Single System Image architecture in the clusters system field. The first problem is to design a communication model that fits, as efficient as possible, kernel to kernel communications needs. The idea of 'communication transaction' has been proposed in order to describe the content of a message and to describe actions upon this message. In a SSI, a global scheduler may move a running process from a node to another one. Our second contribution concerns the design of a communication layer helping in the migration of process communication by streams (socket, pipe, etc.). Dynamic streams are mechanisms allowing to move the extremity of a stream without any performance lost. This communication service has been developped in the Kerrighed SSI. The prototype has been proved with industrial MPI applications (without any modifications of the application or the middleware).
... MOSIX also offers options for parallel I/O operations over a cluster. The Direct File System Access (DFSA) mechanism is "a re-routing mechanism that reduces the extra overhead of executing I/O oriented system calls of a migrated process" (Amar, 2002). It does this by redirecting requests so that they run on the node the process is currently running on and are not sent to the stub on the originating node. ...
... The basic functions of the second design choice is not hard to achieve, i.e. to make the application run over a large number of computer nodes, by placing the repository on a shared file system, as seen in Figure 2. The problem with the second design choice is that there are very few, if any, highly scalable distributed file systems [25,26]. While the UNIX Network File System (NFS) [24] is the most commonly used, it suffers from a number of drawbacks since it is neither a distributed file system nor a parallel file system. ...
Conference Paper
Full-text available
This paper describes creation an autonomous failure-resistant multi-agent file storage, resistant against failure of hardware on certain number of its servers. We analyze multi-agent systems and ways the files are downloaded in computer networks. An agent system is analyzed, designed and implemented. The purpose of this system is to run on outdated servers and/or personal computers. Access to files and storage is realized through web browser and download manager. There is also an independent diagnostic tool implemented for this system.
Conference Paper
Full-text available
We present experimental results demonstrating the performance improvement obtained in our distributed Haskell implementation when we replaced scheduling according to a micromanagement paradigm with contention-driven scheduling with automatic load balancing. This performance enhancement has important implications in the area of automatic run-time optimization.
Conference Paper
Full-text available
This paper presents a communication system designed to allow efficient process migration in a cluster. The proposed system is generic enough to allow the migration of any kind of stream: socket, pipe, char devices. Communicating processes using IP or Unix sockets are transparently migrated with our mechanisms and they can still efficiently communicate after migration. The designed communication system is implemented as part of Kerrighed, a single system image operating system for a cluster based on Linux. Preliminary performance results are presented.
Conference Paper
We have designed and implemented a new portable system that can rapidly construct a computer environment where high-throughput research applications can be performed instantly. One challenge in the instant computing area is constructing a cluster system instantly, and then readily restoring it to its former state. This paper presents an approach for instant computing using Knoppix technology that can allow even a non-computer specialist to easily construct and operate a Beowulf cluster . In the present bio-research field, there is now an urgent need to address the nagging problem posed by having high-performance computers. Therefore, we were assigned the task of proposing a way to build an environment where a cluster computer system can be instantly set up. Through such research, we believe that the technology can be expected to accelerate scientific research. However, when employing this technology in bio-research, a capacity barrier exists when selecting a clustered Knoppix system for a data-driven bioinformatics application. We have approached ways to overcome said barrier by using a virtual integrated RAM-DISK to adapt to a parallel file system. To show an actual example using a reference application, we have chosen InterProScan, which is an integrated application prepared by the European Bioinformatics Institute (EBI) that utilizes many database and scan methods. InterProScan is capable of scaling workload with local computational resources, though biology researchers and even bioinformatics researchers find such extensions difficult to set up. We have achieved the purpose of allowing even researchers who are non-cluster experts to easily build a system of ”Knoppix for the InterProScan4.1 High Throughput Computing Edition.” The system we developed is capable of not only constructing a cluster computer environment composed of 32 computers in about ten minutes (as opposed to six hours when done manually), but also restoring the original environment by rebooting the pre-existing operating system. The goal of our instant cluster computing is to provide an environment in which any target application can be built instantly from anywhere.
Conference Paper
Full-text available
We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, seek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocate an implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO. In addition to basic I/O functionality, we consider the issues of supporting other MPI-IO features, such as 64-bit file sizes, noncontiguous accesses, collective I/O, asynchronous I/O, consistency and atomicity semantics, user-supplied hints, shared file pointers, portable data representation, file preallocation, and some miscellaneous features. We describe how we implemented each of these features on various machines and file systems. The machines we consider are the HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, SGI Origin2000, and networks of workstations; and the file systems we consider are HP HFS, IBM PIOFS, Intel PFS, NEC SFS, SGI XFS, NFS, and any general Unix file system (UFS). We also present our thoughts on how a file system can be designed to better support MPI-IO. We provide a list of features desired from a file system that would help in implementing MPI-IO correctly and with high performance.
Conference Paper
Full-text available
Serverless Network File Systems by Michael Donald Dahlin Doctor of Philosophy in Computer Science University of California at Berkeley Professor David A. Patterson, Chair This thesis presents the design of a serverless network file system, a file system that distributes its functionality among cooperating, networked machines to eliminate the central file server bottleneck found in current network file systems. In a serverless system, any machine can cache, store, or control any block of data. This location independence provides better performance and scalability than traditional file system architectures. Further, because any machine in the system can assume the responsibilities of a failed component, the serverless design can also provide higher availability. This dissertation details the design of three serverless subsystems, each of which distributes a specific piece of functionality that a traditional system would implement as part of a central server. I describe and evaluate coop...
Article
Full-text available
In this paper, we propose a new paradigm for network file system design, serverless network file systems. While traditional network file systems rely on a central server machine, a serverless system utilizes workstations cooperating as peers to provide all file system services. Any machine in the system can store, cache, or control any block of data. Our approach uses this location independence, in combination with fast local area networks, to provide better performance and scalability than traditional file systems. Further, because any machine in the system can assume the responsibilities of a failed component, our serverless design also provides high availability via redundant data storage. To demonstrate our approach, we have implemented a prototype serverless network file system called xFS. Preliminary performance measurements suggest that our architecture achieves its goal of scalability. For instance, in a 32-node xFS system with 32 active clients, each client receives nearly as ...
Article
Full-text available
Optimally partitioning application and filesystem functionality within a cluster of clients and servers is a difficult problem due to dynamic variations in application behavior, resource availability, and workload mixes. This paper presents ABACUS, a run-time system that monitors and dynamically changes function placement for applications that manipulate large data sets. Several examples of data-intensive workloads are used to show the importance of proper function placement and its dependence on dynamic run-time characteristics, with performance differences frequently reaching 2--10X. We evaluate how well the ABACUS prototype adapts to run-time system behavior, including both long-term variation (e.g., filter selectivity) and short-term variation (e.g., multi-phase applications and inter-application resource contention). Our experiments with ABACUS indicate that it is possible to adapt in all of these situations and that the adaptation converges most quickly in those cases where the p...
Article
Full-text available
MOSIX is a cluster computing enhancement of Linux that supports preemptive process migration. This paper presents the MOSIX Direct File System Access (DFSA), a provision that can improve the performance of cluster file systems by migrating the process to the file, rather then the traditional way of bringing the file's data to the process. DFSA is suitable for clusters that manage a pool of shared disks among multiple machines. With DFSA, it is possible to migrate parallel processes from a client node to file servers enabling parallel access to different files. DFSA can work with any file system that maintains cache consistency. Since no such file system is currently available for Linux, we implemented the MOSIX File-System (MFS) as a first prototype using DFSA. The paper describes DFSA and presents the performance of MFS with and without DFSA. 1 Introduction Recent advances in cluster computing and the ability to generate and migrate parallel processes to many machines has ...
Article
Existing file system benchmarks are deficient in portraying performance in the ephemeral small-file regime used by Internet software, especially: electronicmail; netnews; and web-based commerce. PostMark is a new benchmark to measure performance for this class of application.In this paper, PostMark test results are presented and analyzed for both UNIX and Windows NT application servers. Network Appliance Filers (file server appliances) are shown to provide superior performance (via NFS or CIFS) compared to local disk alternatives, especially at higher loads. Such results are consistent with reports from ISPs (Internet Service Providers) who have deployed NetApp filers to support such applications on a large scale.
Article
Analysis of data is an important step in understanding and solving a scientific problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scientific datasets makes it a difficult task to efficiently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support efficient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments.
Conference Paper
Scalable computing clusters (SCC) are becoming an alternative to mainframes and MPP's for the execution of high performance, demanding applications in multi-user, time-sharing environments. In order to better utilize the multiple resources of such systems, it is necessary to develop means for cluster wide resource allocation and sharing, that will make an SCC easy to program and use. This paper presents the details of a memory ushering algorithm among the nodes of an SCC. This algorithm allows a node which has exhausted its main memory to use available memory in other nodes. The paper first presents results of simulations of several algorithms for process placement to nodes. It then describes the memory ushering algorithm of the MOSIX multicomputer operating system for an SCC and its performance
Article
Mosix is a software tool for supporting cluster computing. It consists of kernel-level, adaptive resource sharing algorithms that are geared for high performance, overhead-free scalability and ease-of-use of a scalable computing cluster. The core of the Mosix technology is the capability of multiple workstations and servers (nodes) to work cooperatively as if part of a single system. The algorithms of Mosix are designed to respond to variations in the resource usage among the nodes by migrating processes from one node to another, preemptively and transparently, for load-balancing and to prevent memory depletion at any node. Mosix is scalable and it attempts to improve the overall performance by dynamic distribution and redistribution of the workload and the resources among the nodes of a computing-cluster of any size. Mosix conveniently supports a multi-user time-sharing environment for the execution of both sequential and parallel tasks. So far Mosix was developed 7 times, for differ...