PreprintPDF Available

Orca: A Software Library For Parallel Computation of Symbolic Expressions via Remote Evaluation on MPI Systems

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This study describes a Scheme library, named Orca, which is used to compute symbolic expressions in parallel via remote evaluation based on the message-passing interface (MPI) standard. Today, MPI is one of the most used standards, in particular high-performance computing systems. However, MPI programmers are explicitly required to deal with many complexities that render MPI programming hard to reason about. We designed and implemented a set of new APIs to alleviate this complexity by taking advantage of the expressive power of Scheme language using remote evaluation techniques on MPI systems. We introduce the application programming interface (API) of the library, and evaluate the implemented model on a real-world application of a common parallel algorithm. Our experiments show that it is practical and useful for a variety of applications to exploit multiple processors of a distributed-memory architecture.
Content may be subject to copyright.
Orca: A Software Library For Parallel
Computation of Symbolic Expressions via
Remote Evaluation on MPI Systems
Ahmet Artu Yıldırım
Independent Researcher,
Abstract. This study describes a Scheme library, named Orca, which is
used to compute symbolic expressions in parallel via remote evaluation
based on the message-passing interface (MPI) standard. Today, MPI is
one of the most used standards, in particular high-performance comput-
ing systems. However, MPI programmers are explicitly required to deal
with many complexities that render MPI programming hard to reason
about. We designed and implemented a set of new APIs to alleviate this
complexity by taking advantage of the expressive power of Scheme lan-
guage using remote evaluation techniques on MPI systems. We introduce
the application programming interface (API) of the library and evaluate
the implemented model on a real-world application of a common parallel
algorithm. Our experiments show that it is practical and useful for a vari-
ety of applications to exploit multiple processors of a distributed-memory
Keywords: parallel computation, symbolic expression, remote evalua-
tion, MPI, message-passing interface, scheme
1 Introduction
High-performance computing (HPC) has been gaining much attraction for decades
among research institutes and industries. Powerful computing capabilities of
HPC systems enable us to collectively solve a diverse range of complex problems
by leveraging many interconnected processors communicating over fast network
MPI has been a well-known and widely used standard in the high-performance
computing community to develop parallel algorithms on computer clusters as
well as single machines [8]. MPI provides an abstraction for inter-communication
by passing messages between processes through cooperative operations across
different machine and network architectures, thus promoting portability and
ease of use [15].
However, parallel programming using MPI is not without complications, re-
quiring programmers to deal with many complexities including multiple code
branches in code, difficulty in designing deadlock-free programs, buffer manage-
ment, and the side-effect nature of the APIs.
2 Ahmet Artu Yıldırım
As a special case of message-passing model, the remote procedure call (RPC)
mechanism provides a synchronized type-safe communication between two pro-
cesses, typically called as client and server processes [22], giving the ”illusion”
to programmers that the procedure is invoked locally that makes it easy to rea-
son about. Furthermore, remote evaluation (REV) generalizes remote procedure
calls by providing the ability to evaluate a program expression at a remote com-
puter [20]. In this technique, a program statement is sent from one computer,
client, to another computer called server. When the server receives a REV re-
quest, it extracts the procedure and arguments from the message, evaluates the
procedure with arguments, and returns the result message to the client.
In this work, we present a programming model for the parallel computation
of symbolic expressions and implement it as a Scheme [21] library, named Orca.
The library allows programmers to evaluate the given symbolic expressions in
parallel on multiple processes. In this model, the procedure and arguments of the
expression are distributed to processes in different data partitioning methods,
by either broadcasting or scattering according to the used API. We make use of
functional programming techniques through remote evaluation abstractions, on
top of the synchronous and asynchronous communication capabilities of MPI,
to alleviate the complexity in parallel programming on MPI systems.
The rest of the paper is organized as follows. Section 2 presents the back-
ground and motivation of this study. In Section 3, we describe the computa-
tional model of the implemented library and describe the functionality of the
APIs. Section 4 presents an empirical program developed using the library and
performance evaluation of turnaround time in broadcasting operation for vary-
ing numbers of processors and data sizes. Finally, we conclude in Section 5 by
discussing future research directions.
2 Background and Motivation
2.1 What is wrong with MPI?
From the perspective of programming, general problems encountered in MPI
programs include that of overhead in the decomposition, development, and de-
bugging of parallel programs [19]. We argue that the aforementioned complexity
arises largely due to the following reasons:
Multiple code branches exist to control the flow of the algorithm per pro-
cess group in many MPI programs using if-else constructs. One common
model is the master/worker pattern [16] which is commonly employed through
explicit two groups of conditional branches: one set of conditional blocks for
the master process that orchestrates the computation, and performs work
distribution; and another set of conditional blocks for worker processes per-
forming the common task. Having such segregated execution flows result in
complexity in MPI programs and thus making it hard to write and main-
tain parallel programs. Such complexity exacerbates when the flow of the
program is based on random data [4].
Orca: Parallel Computation of Symbolic Expressions on MPI 3
Buffer management is a fundamental aspect of MPI programming in C and
Fortran as part of the standard [15] that programmers have been expected
to allocate memory buffers with the specified data type, and then pass the
pointers of buffers as arguments to the MPI collective and point-to-point
primitives. Such memory management methods are prone to memory cor-
ruption and memory leakage, when not managed properly. There are wrap-
per libraries created that we found useful providing higher-level APIs on top
of MPI that abstract away explicit memory management and allow using
generic data types [3,17,7].
Side effects are inevitable in MPI due to its buffer-based communication na-
ture. For example, the basic function MPI Recv accepts a memory buffer as
an argument and, in return, fills the memory with the data received from
the given process. This way of programming makes it highly susceptible to
software bugs that arise when the process modifies the content of the re-
turned buffer unintentionally. The functional programming paradigm solves
this problem via pure functions which are idempotent and do not cause any
side-effects, independent of the program’s execution environment in pro-
ducing results, that considerably simplifies how the parallel programs are
constructed [9,12,11].
Deadlocks might occur when a process or process group attempts to receive
data with no matching send operation causing the receiving counterpart to
hang [18]. Finding deadlocks is notoriously difficult in MPI because of the
low-level nature of the MPI APIs and high interactivity between processes
during the execution of the program [13,10,4].
2.2 Related Works
MPI for Python provides bindings of the Message Passing Interface (MPI) stan-
dard for the Python programming language, along with an object-oriented and
relatively easier interface on top of the MPI-1/2/3 specifications [3]. In addition
to the common MPI functionality, the library allows any Python objects to be
communicated between processes.
MPI Ruby is another library providing bindings of the MPI standard, exploit-
ing Ruby’s dynamic nature [17]. Since everything is an object in Ruby language,
messages are also interpreted as objects, converted to a byte stream that gives
more flexibility for message transmission and easier programming than MPI’s C
library. It is argued that such libraries do not only simplify the task of writing
parallel programs but also make them ideal tools for rapid prototyping which is
deemed essential in the scientific computing community.
Avalon extends the client/server model and provides Common Lisp con-
structs for evaluating expressions remotely on any type of Common Lisp types
with marshalling and unmarshalling capabilities [2]. Avalon supports transac-
tional remote evaluation in which, when a crash occurs, the system recovers the
state to the previous saved consistent state using atomic commits.
4 Ahmet Artu Yıldırım
2.3 Symbolic Expression
A symbolic expression, in the context of Lisp (LISt Processor) programming
language, is a program statement composed of either a single atomic data (i.e.
symbol, number) or an ordered list of atomic data in the form of a binary tree
that evaluates to a value [14]. What differentiates Lisp from most other pro-
gramming languages is that Lisp treats ”code as data” in the form of symbolic
expressions that provides enormous flexibility for programmers through its ad-
vanced macro system. A simple example is as follows.
(define input (list 1 2 3))
(lambda (x) (* x 3))
=> (3 6 9)
In this example, map function evaluates to a list of numbers (3 6 9) as the
result of the application of (* x 3) to each item in the input list (1 2 3). The
function map is pure in the sense that it doesn’t modify any value of arguments,
and therefore no side effects are introduced at runtime, so applying the same
function to the same arguments always produces the same result without any
change in the program state. Such methodology renders the program more con-
cise and less prone to software bugs, with added benefits of improved testability,
reliability, and maintainability [1].
3 Programming Model
The programming model of the Orca library is based upon the MPI standard,
that defines the foundation for providing communication among a set of pro-
cesses with separate memory spaces. In this model, many processes are typi-
cally launched across the nodes by the job scheduler system (i.e. SLURM [23])
on a computer cluster, or possibly even locally, in a single program, multiple
data (SPMD) fashion where the computation differs based on the rank number
uniquely identifying a process.
Furthermore, our model categorizes processes into two types: master process,
and worker process. A worker process waits for an incoming symbolic expres-
sion from the master process, executes it in its runtime environment, and then
sends the result of the computation back to the master process. Besides, the
master process runs a kernel program defining the main computation using the
implemented APIs that control the distribution of symbolic expressions among
the worker processes and collect the computation results from them into a list,
ordered by rank number. Orca library allows all worker processes with separate
memory spaces to collectively apply the specified function to the given arguments
which are either broadcasted or partitioned among the processes in a way similar
to MPI Bcast and MPI Scatter functions in the MPI standard, respectively.
Orca: Parallel Computation of Symbolic Expressions on MPI 5
In Orca, we extend the remote evaluation paradigm by having multiple
servers instead of one, and one client acting as a master process driving the execu-
tion of the computation, utilizing MPI’s asynchronous communication routines.
The kernel code is the main Scheme code for applications, in which program-
mers implement a parallel program for executing in a computer cluster system
as well as a single machine, ideally with multiple processor cores, by explicitly
specifying different data partitioning strategies for function and arguments to
be evaluated on worker processes in parallel.
3.1 Module Structure
We implement the library using Guile programming language, which is an im-
plementation of the Scheme programming language. Guile conforms to R5RS,
R6RS, and R7RS Scheme standards, as well as extends Scheme with a variety
of features including a module system, dynamic linking and foreign function
interface (FFI) [6].
Master Process
MPI Library
Worker Process 1
MPI Library
Worker Process 2
MPI Library
Worker Process n
MPI Library
MPI Communication
Fig. 1: Communication layer of Orca library
We benefit from Guile’s module system, which is used to provide a ”names-
pace” for a set of symbols bound to Scheme objects as well as logically orga-
nizes the Scheme symbols. Orca library consists of modules (orca) and (orca
internal). Binding functions that act as an interface to the MPI C library pro-
cedures, along with internal functions reside in the (orca internal) module. On
top of the internal module, the module (orca) implements all the public APIs
that applications import to consume the APIs. Communication layers of the
Orca library are depicted in Figure 1.
6 Ahmet Artu Yıldırım
Listing 1.1: rpc-worker-process-id
Listing 1.2: rpc-worker-process-size
3.2 Overview of APIs
In this section, we briefly explain the implemented APIs of the Orca library. We
categorize the APIs into three classes: utility APIs such as rpc-worker-process-id
and rpc-worker-process-size; bootstrapping APIs such as rpc-start, and compute
APIs performing the given computation.
As a utility API, worker processes can determine their id that uniquely deter-
mines every worker process using the API rpc-worker-process-id where process
numbering begins with 0 and goes up to size 1. Total number of worker pro-
cesses (size) is determined via the API rpc-worker-process-size.
To bootstrap the execution environment, rpc-start is required to be called by
all processes, however, the behaviour of the function differs based on the role of
the process. When a worker process reaches the function rpc-start, the process
enters a loop and the flow of the execution never goes beyond the rpc-start in
the kernel code. On the other hand, the execution continues when the function
rpc-start is called in the context of the master process. In the loop, the function
rpc-start instructs the worker process to listen on the message channel for the
RPC message containing the symbolic expression from the master process. Once
the expression is executed in its address space, the worker returns the result into
the master process and then again waits for another incoming message from the
master process.
The function rpc-finalize is called by the master process to terminate the
execution environment. When the function is invoked, a finalize message is sent
to all worker processes that instruct them to terminate the loop, and then worker
processes call MPI Finalize, followed by the exit command causing the process
termination with success. Note that calling any Orca function results in an error,
as the execution environment is terminated beforehand.
The function rpc-make is the most basic compute API so as to execute sym-
bolic expressions collectively. The function evaluates the given symbolic expres-
sion in a “quoted” form on the runtime environment of the worker processes,
Listing 1.3: rpc-start
Orca: Parallel Computation of Symbolic Expressions on MPI 7
Listing 1.4: rpc-finalize
(rpc-make <symbolic expression>) list
Listing 1.5: rpc-make
and then returns a list of values where each i-th element in the list is the result
of the computation from the i-th worker.
Master Process Worker Process 1 Worker Process n
(rpc-make exp)
(result1,..., resultn)
[while not_finalized]
Fig. 2: Collective Execution of the Symbolic Expression via rpc-make
Figure 2 shows the interaction between processes for the computation of the
given symbolic expression from start to the end of execution using rpc-make API.
This design generalizes to Orca ’s all computing APIs in a sense that upon the
invocation of rpc-start, worker processes execute the given expression, and then
master process coalesces the results of the computation received from worker
processes into a single list as a returned value.
All the compute APIs except rpc-make expect a varying number of param-
eters that, in turn, are passed to the specified remote function as parameters.
We employ pull-based communication for the distribution of function parame-
ters. In this model, the master process distributes function parameters either
by broadcasting or scattering to all worker processes that define the type of the
compute APIs in the Orca library; broadcast and scatter APIs, respectively. On
the other hand, worker processes can only send data to the master process by
returning a value from the function invoked remotely.
8 Ahmet Artu Yıldırım
(rpc-apply-bcast <function symbol> <parameter 1> ...) list
Listing 1.6: rpc-apply-bcast
(rpc-apply-scatter <function symbol> <parameter 1> ...) list
Listing 1.7: rpc-apply-scatter
The API rpc-apply-bcast sends serialized function symbol, along with the
values of all parameters from the master process to all worker processes in which
parameter can be any type of literal value, including a list or a single value.
The function symbol needs to be accessible by all execution environments of
(rpc-make '(func 1 2 3))
(func 1 2 3) (func 1 2 3) (func 1 2 3)
Worker process 1 Worker process 2 Worker process 3
Master process
(a) rpc-make
(rpc-apply-bcast func 10 '(1 2 3))
(func 10 '(1 2 3)) (func 10 '(1 2 3)) (func 10 '(1 2 3))
Worker process 1 Worker process 2 Worker process 3
Master process
(b) rpc-apply-bcast
(rpc-apply-scatter func '(1 2 3) '(4 5 6))
(func 1 4) (func 2 5) (func 3 6)
Worker process 1 Worker process 2 Worker process 3
Master process
(c) rpc-apply-scatter
(rpc-scatter (list '(func1 1 10) '(func2 20) '(func3 30))
(func1 10) (func2 20) (func3 30)
Worker process 1 Worker process 2 Worker process 3
Master process
(d) rpc-scatter
Fig. 3: The master process applies given symbolic expression with respect to
varying parameter distribution models.
The scatter API rpc-apply-scatter also expects a varying number of param-
eters, however, every parameter must be a list whose size requires to be the
total number of worker processes. In this case, the scatter API sends the i-th
element of the list to the i-th worker process in the parameter list. The master
process distributes the element of each list in parameters to each worker process
in order, determined by the worker process id, and then worker processes apply
the given function to its parameters. Note that function parameters are stored
in a list, and thus provides flexibility and ease, in constructing parameters at
runtime for a varying number of worker processes that can be determined via
the API rpc-worker-process-size.
Another scatter API is rpc-scatter which is designed to perform task par-
allelism, instead of data parallelism as applied to other compute APIs. Task
Orca: Parallel Computation of Symbolic Expressions on MPI 9
(rpc-scatter <symbolic expression 1> ...) list
Listing 1.8: rpc-scatter
parallelism is a technique employed to enhance the locality by executing differ-
ent parts of a program on available processes [5]. rpc-scatter API accepts a list
of symbolic expressions whose size equals the number of worker processes, where
i-th expression is sent to the i-th worker process to be evaluated. As depicted in
Figure 3, rpc-scatter allows to apply disjoint functions to the given parameters
on worker processes. In this way, the master process retrieves the results of the
disjoint applications executed concurrently in a single call.
3.3 Marshalling
Orca library transforms parameter expressions to UTF-8 format using the Scheme
function call-with-output-string for transmission. We verified on the Guile run-
time that Orca library supports implicit marshalling of generic data, such as
numeric values and strings, and compound data including list, vector, array,
byte vector, and association list that might contain generic data as well as com-
pound data in a nested fashion. However, we note that in GOOPS, which is the
object-oriented system in Guile, objects are required to be explicitly serialized to
the supported symbolic expression (i.e. list) before passing into Orca APIs, and
deserialized back into GOOPS object on worker processes in the given remote
function, or vise versa.
Symbolic expressions are also required to be non-pointer values to be seri-
alized properly. This is a fundamental design requirement, otherwise, the Orca
library raises an error and terminates all processes when a pointer value is pro-
vided. Furthermore, the literal symbols are allowed to contain any user symbols
defined before the rpc-start call as they become global variables across all pro-
4 Experiments and Discussion
4.1 A Case Study: Distributed Reduction Operation
As a case study, we implemented a distributed reduction operation using the
Orca library. The purpose of this case study is to apply the proposed program-
ming model to one of the common parallel programming problems and see its
effectiveness in developing a parallel program that avoids multiple code branches
and side effects by leveraging functional programming techniques.
As shown in Figure 4, in a distributed reduction problem, typically we par-
tition input data into disjoint chunks. Then, each chunk of data is processed by
each worker process using the same reduction function proc on every element of
its input data to reduce them into a single local value. Worker processes perform
10 Ahmet Artu Yıldırım
Fig. 4: Worker processes apply the function proc in parallel on disjoint data in
Step 1, and finally the master process reduces the local results to compute global
results using the same function in Step 2.
this reduction operation in parallel using different input data. Once the local
results have been computed, each worker process sends the result of the reduc-
tion into the master process, and the master process begins to perform global
reduction operation on the data whose size is the number of worker processes.
Finally, the master process computes the global value using the same reduction
function proc.
1#!/usr/bin/env -S guile -s
4(use-modules (orca))
5(use-modules (srfi srfi-1))
6(use-modules (ice-9 regex))
7(use-modules (ice-9 ftw))
8(use-modules (ice-9 textual-ports))
10 (define proc max)
11 (define init-value -1)
13 (define (read-db-numbers db-lst)
14 ...)
16 (define (apply-distributed-proc db-lst)
17 (reduce proc init-value (read-db-numbers db-lst)))
19 (rpc-start)
21 (define (partition-db-lists)
22 ...)
24 (format #t "Result of the distributed reduction operation is ~d~%"
Orca: Parallel Computation of Symbolic Expressions on MPI 11
25 (reduce proc init-value
26 (rpc-apply-scatter apply-distributed-proc (partition-db-lists))))
28 (rpc-finalize)
Listing 1.9: Kernel of distributed reduction operation using Orca library. The
implementation details of read-db-numbers and partition-db-lists are omitted,
three dots is used as a placeholder.
The kernel code of the distributed reduction algorithm is given in Listing 1.9.
The first line instructs the operating system to execute the kernel file via the
Guile interpreter. Lines [4-8] imports the symbols of the used Guile modules in
the kernel, including orca module which is required for all Orca programs. We
defined the proc to be the max function that returns the maximum number of
given numbers, and init-value to be 1, which has been provided to the reduce
function as a default value. The symbols proc,init-value,read-db-numbers and
apply-distributed-proc are all defined before the rpc-start call, thus they have
become visible in the runtime environments on all processes.
We omit the implementation details of the user-defined functions of partition-
db-lists and read-db-numbers for clarification. When the worker processes reach
the rpc-start call, they become blocked and wait for receiving the symbolic ex-
pressions from the master process to compute. However, master process contin-
ues the kernel program and calls reduce function whose parameters are proc,
init-value and the result of the rpc-apply-scatter, respectively. In turn, Guile
interpreter calls rpc-apply-scatter function that calls the partition-db-lists func-
tion; a user-defined function returning a list of lists where every list object at
i-th position corresponds to a set of separate input file paths on the disk for i-th
worker process. Note that every process shares a disk storing input files, and in-
put files contain a collection of numbers. rpc-apply-scatter distributes the paths
of input files to worker processes and instructs the Orca library to apply these
input files to the apply-distributed-proc function on each worker process in par-
allel in line 28. As an implementation detail, we utilize Scheme’s macro feature
to transform the parameters of rpc-apply-scatter into a respective symbolic ex-
pression to be evaluated by the worker processes. When worker processes receive
the function and parameters deserialized from the textual form of the symbolic
expression, they call the apply-distributed-proc function in line 17. Note that
all the numbers on the input files are loaded into memory via read-db-numbers
function, and then the reduction operation is performed on these values. When
the apply-distributed-proc function returns, the Orca library collects all the local
results of worker processes and returns to the reduce function in line 25 for the
master process, which in turn computes the global result and prints it. Finally,
the master process calls rpc-finalize to terminate the execution environment of
12 Ahmet Artu Yıldırım
4.2 Performance Evaluation
We have conducted experiments to evaluate the turnaround time in broadcasting
operation using the API rpc-apply-broadcast on varying list size and many worker
processes, and compared this time with respect to a corresponding MPI program
in C language. Our experiments are performed on a single machine using 32
operating system processes. The machine we used is equipped with a 2.6 GHz
6-core Intel Core i7 processor and 16 GB DDR4 memory. We used Guile 3.0.4 to
compile Orca library, and MPICH library, version 12.1.8, as an implementation
of the Message Passing Interface (MPI) standard. We compiled the MPI program
using GCC version 9.3.0.
(a) MPI in C (b) Orca
Fig. 5: Turnaround time in milliseconds with respect to number of worker pro-
cesses and number of integers in the dataset
For the sake of experiments, we implemented two programs: one is the Orca
program using rpc-apply-broadcast API sending an integer list and collects re-
sults from worker processes, and the other program is a C program using MPI
functions for broadcasting integer list. We conducted experiments using 7 lists
with varying lengths where each list consists of 32-bit integers. We measured
the turnaround time of the broadcasting operation for both programs, which is
defined as wall-clock time between the call of rpc-apply-broadcast and its comple-
tion. rpc-apply-broadcast transforms the arguments to the corresponding datum
representation, which is then marshalled into textual form for transmission via
MPI. Therefore, turnaround time in the Orca program includes datum trans-
formation time, marshaling time, evaluation time, communication time, and the
overhead caused by Guile’s runtime environment. However, the turnaround time
of the MPI program only contains the communication time.
The turnaround times are plotted in the Figure 5. We observe linear correla-
tions between the size of the list and the number of worker processes in the case
of the Orca program. The maximum turnaround time spent in broadcasting a list
Orca: Parallel Computation of Symbolic Expressions on MPI 13
of 216 integers is 538 milliseconds using 32 processes. In comparison, we obtain
a maximum turnaround time of 121 milliseconds in the MPI program. We con-
clude that experimental results show that the overhead incurred by the runtime
and data marshaling is insignificant when compared to the MPI program.
5 Conclusions and Future Work
In this paper we have presented Orca; a software library implementing a set
of RPC APIs which are used for the computation of symbolic expressions in
parallel by utilizing MPI library to transmit expressions between processes. We
take advantage of the expressive power of Scheme language to represent code as
data in the form of symbolic expressions. Scheme language promotes a functional
programming paradigm that allows us to define “higher-order” API functions
taking a function as an argument, which is also pure in the sense that no side
effects are allowed. Such features of the programming model help us alleviate the
inherent problems we face in programming on MPI systems which include code
complexity due to multiple code branches, buffer management, and inevitable
side-effects, by building a new set of APIs based on remote evaluation (REV)
As future research, we will investigate an extension to the Orca library by
having a new set of APIs to stream expression data between worker processes
and the master process. That way we aim to define parallel computations on big
data, as well as “infinite” data to process each of the expression chunks via the
given function until a given condition is met.
Scripting languages are useful for prototyping and integrating different soft-
ware systems rapidly, but they are not without performance penalties. This is
also true for Guile language, as an implementation of Scheme language, in which
we use its eval function extensively to execute the given expressions at runtime.
In addition to the performance cost coming with the language, there is also a
communication cost because Orca library transmits symbolic expressions in a
textual form. We will investigate how we can alleviate these problems to be able
to achieve higher performance on MPI systems.
The source code of the Orca library is made available at https://www. under LGPL version 3 or later.
1. Brown, J.R., Nelson, E.: Functional programming. Tech. rep., TRW DEFENSE
2. Clamen, S.M., Leibengood, L.D., Nettles, S.M., Wing, J.M.: Reliable distributed
computing with avalon/common lisp. In: Proceedings. 1990 International Confer-
ence on Computer Languages, pp. 169–179 (1990)
3. Dalc´ın, L., Paz, R., Storti, M.: Mpi for python. Journal of Parallel and Distributed
Computing 65(9), 1108–1115 (2005)
14 Ahmet Artu Yıldırım
4. Forejt, V., Joshi, S., Kroening, D., Narayanaswamy, G., Sharma, S.: Precise
predictive analysis for discovering communication deadlocks in mpi programs.
ACM Trans. Program. Lang. Syst. 39(4) (2017). DOI 10.1145/3095075. URL
5. Foster, I.: Task parallelism and high-performance languages. IEEE Parallel Dis-
tributed Technology: Systems Applications 2(3), 27– (1994)
6. Foundation, F.S.: (2020). URL
docs-2.2/guile-ref/index.html. [Online; accessed 21-June-2020]
7. Gregor, D., Troyer, M.: (2020). URL
doc/html/mpi.html. [Online; accessed 21-June-2020]
8. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming
with the Message-Passing Interface. The MIT Press (2014)
9. Hammond, K.: Why parallel functional programming matters: Panel statement. In:
International Conference on Reliable Software Technologies, pp. 201–205. Springer
10. Haque, W.: Concurrent deadlock detection in parallel programs. Interna-
tional Journal of Computers and Applications 28(1), 19–25 (2006). DOI 10.
1080/1206212X.2006.11441784. URL
11. Hughes, J.: Why functional programming matters. The computer journal 32(2),
98–107 (1989)
12. Jones, M.P., Hudak, P.: Implicit and explicit parallel programming in haskell.
Dispon´ıvel por FTP em nebula. systemsz. cs. yale. edu/pub/yale-fp/reports/RR-
982. ps. Z (julho de 1999) (1993)
13. Luecke, G.R., Zou, Y., Coyle, J., Hoekstra, J., Kraeva, M.: Deadlock detection in
mpi programs. Concurrency and Computation: Practice and Experience 14(11),
911–932 (2002)
14. McCarthy, J., Levin, M.I.: LISP 1.5 programmer’s manual. MIT press (1965)
15. Message Passing Interface Forum: Mpi: A message-passing interface standard, ver-
sion 3.1. Specification (2015). URL https://www.mpi-
16. Murthy, V.K., Krishnamurthy, E.V.: Software pattern design for cluster comput-
ing. In: Proceedings. International Conference on Parallel Processing Workshop,
pp. 360–367 (2002)
17. Ong, E.: Mpi ruby: scripting in a parallel environment. Computing in Science
Engineering 4(4), 78–82 (2002)
18. Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA (1996)
19. Smith, L.: Mixed mode mpi/openmp programming. UK High-End Computing
Technology Report pp. 1–25 (2000)
20. Stamos, J.W., Gifford, D.K.: Remote evaluation. ACM Trans. Program. Lang.
Syst. 12(4), 537–564 (1990). DOI 10.1145/88616.88631. URL
21. Sussman, G.J., Steele, G.L.: Scheme: A interpreter for extended lambda calculus.
Higher-Order and Symbolic Computation 11(4), 405–439 (1998)
22. Wilbur, S., Bacarisse, B.: Building distributed systems with remote procedure call.
Software Engineering Journal 2(5), 148–159 (1987)
23. Yoo, A.B., Jette, M.A., Grondona, M.: Slurm: Simple linux utility for resource
management. In: Workshop on Job Scheduling Strategies for Parallel Processing,
pp. 44–60. Springer (2003)
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
The Message Passing Interface (MPI) is the standard API for parallelization in high-performance and scientific computing. Communication deadlocks are a frequent problem in MPI programs, and this article addresses the problem of discovering such deadlocks. We begin by showing that if an MPI program is single path, the problem of discovering communication deadlocks is NP-complete. We then present a novel propositional encoding scheme that captures the existence of communication deadlocks. The encoding is based on modeling executions with partial orders and implemented in a tool called MOPPER. The tool executes an MPI program, collects the trace, builds a formula from the trace using the propositional encoding scheme, and checks its satisfiability. Finally, we present experimental results that quantify the benefit of the approach in comparison to other analyzers and demonstrate that it offers a scalable solution for single-path programs.
Full-text available
Inspired by ACTORS [7, 17], we have implemented an interpreter for a LISP-like language, SCHEME, based on the lambda calculus [2], but extended for side effects, multiprocessing, and process synchronization. The purpose of this implementation is tutorial. We wish to: 1.alleviate the confusion caused by Micro-PLANNER, CONNIVER, etc., by clarifying the embedding of non-recursive control structures in a recursive host language like LISP. 2.explain how to use these control structures, independent of such issues as pattern matching and data base manipulation. 3.have a simple concrete experimental domain for certain issues of programming semantics and style. This paper is organized into sections. The first section is a short “reference manual” containing specifications for all the unusual features of SCHEME. Next, we present a sequence of programming examples which illustrate various programming styles, and how to use them. This will raise certain issues of semantics which we will try to clarify with lambda calculus in the third section. In the fourth section we will give a general discussion of the issues facing an implementor of an interpreter for a language based on lambda calculus. Finally, we will present a completely annotated interpreter for SCHEME, written in MacLISP [13], to acquaint programmers with the tricks of the trade of implementing non-recursive control structures in a recursive language like LISP. This report describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for the laboratory's artificial intelligence research is provided in part by the Advanced Research Projects Agency of the Department of Defense under Office of Naval Research contract N00014-75-C-0643.
As software becomes more and more complex, it is more and more important to structure it well. Well-structured software is easy to write, easy to debug, and provides a collection of modules that can be re-used to reduce future programming costs. Conventional languages place conceptual limits on the way problems can be modularised. Functional languages push those limits back. In this paper we show that two features of functional languages in particular, higher-order functions and lazy evaluation, can contribute greatly to modularity. As examples, we manipulate lists and trees, program several numerical algorithms, and implement the alpha-beta heuristics (an Artificial Intelligence algorithm used in game-playing programs). Since modularity is the key to successful programming, functional languages are vitally important to the real world.
Many parallel programs have been developed that use message passing for communication. This leads to efficient and portable programs, but their complexity makes them hard to debug. One of the common problems in such programs is the detection of deadlocks. A deadlock detector, MPIDD, has been developed for dynamically detecting deadlocks in parallel programs that are written using C++ and MPI. The detection code for most of the blocking and non-blocking point-to-point and collective routines has been implemented. The code has been tested against an extensive test suite, application programs, and some publicly available benchmarks. The detector takes advantage of the MPI's profiling layer, requires no significant modification of user's code, and incurs very little overhead when invoked. Portability of the detector code is also a key advantage.
It has often been suggested that functional languages provide an excellent basis for programming parallel computer systems. This is largely a result of the lack of side eeects which makes it possible to evaluate the subexpressions of a given term without any risk of interference. On the other hand, the lack of side-eeects has also been seen as a weakness of func-tional languages since it rules out many features of traditional imperative languages such as state, I/O and exceptions. These ideas can be simulated in a functional lan-guage but the resulting programs are sometimes unnatural and ineecient. On the bright side, recent work has shown how many of these features can be naturally incor-porated into a functional language without compromising eeciency by expressing com-putations in terms of monads or continuations. Unfortunately, the \single-threading" implied by these techniques often destroys many opportunities for parallelism. In this paper, we describe a simple extension to the Haskell I/O monad that allows a form of explicit high-level concurrency. It is a simple matter to incorporate these features in a sequential implementation, and genuine parallelism can be obtained on a parallel machine. In addition, the inclusion of constructs for explicit concurrency en-hances the use of Haskell as an executable speciication language, since some programs are most naturally described as a composition of parallel processes.
A remote procedure call (RPC) mechanism is usually a type-checked mechanism which permits a language level call on one machine to be turned automatically into a language level call in a process on another machine. If the RPC mechanism is in possession of the type specifications of the remote procedures and their parameters, a presentation layer can convert data from the format required by the calling machine to that required by the target machine. Remote procedure call is gaining popularity as a simple, transparent and useful paradigm for building distributed systems. Ideal transparency means that remote procedure calls are indistinguishable from local ones. This is usually only partially achievable. This paper discusses those implementation decisions which affect transparency and intrude on the design of distributed applications built using remote procedure call.
Conference Paper
Parallel programming is returning to importance. Functional programming ideas offer a way to break through the barriers that restrict parallel programmers, dramatically simplifying how parallelism can be exploited. This paper explores some ideas of abstraction from functional programming, showing how functional programming offers opportunities to deal with real problems of parallelism.