ArticlePDF Available

Generating events with style

Authors:

Abstract

Threads and events are two common abstractions for writing concurrent programs. Because threads are often more convenient, but events more efficient, it is natural to want to translate the former into the latter. However, whereas there are many different event-driven styles, existing translators often apply ad-hoc rules which do not reflect this diversity. We analyse various control-flow and data-flow encodings in real-world event-driven code, and we observe that it is possible to generate any of these styles automatically from threaded code, by applying certain carefully chosen classical program transformations. In particular, we implement two of these transformations, lambda lifting and environments, in CPC, an extension of the C language for writing concurrent systems. Finally, we find out that, although rarely used in real-world programs because it is tedious to perform manually, lambda lifting yields better performance than environments in most of our benchmarks.
Generating events with style
Matthieu Boutier1and Gabriel Kerneis1,2
1PPS, Universit´e Paris Diderot, {first.last}@pps.univ-paris-diderot.fr
2University of Cambridge
Abstract.
Threads and events are two common abstractions for writing
concurrent programs. Because threads are often more convenient, but
events more efficient, it is natural to want to translate the former into
the latter. However, whereas there are many different event-driven styles,
existing translators often apply ad-hoc rules which do not reflect this
diversity.
We analyse various control-flow and data-flow encodings in real-world
event-driven code, and we observe that it is possible to generate any
of these styles automatically from threaded code, by applying certain
carefully chosen classical program transformations. In particular, we
implement two of these transformations, lambda lifting and environments,
in CPC, an extension of the C language for writing concurrent systems.
Finally, we find out that, although rarely used in real-world programs
because it is tedious to perform manually, lambda lifting yields better
performance than environments in most of our benchmarks.
Keywords: Concurrency, program transformations, event-driven style
1 Introduction
Most computer programs are concurrent programs, which need to perform several
tasks at the same time. For example, a network server needs to serve multiple
clients at a time; a GUI needs to handle multiple keyboard and mouse inputs;
and a network program with a graphical interface (e.g. a web browser) needs to
do both simultaneously.
Translating threads into events
There are many different techniques to
implement concurrent programs. A very common abstraction is provided by
threads, or lightweight processes. In a threaded program, concurrent tasks are
executed by a number of independent threads which communicate through a
shared memory heap. An alternative to threads is event-driven programming.
An event-driven program interacts with its environment by reacting to a set of
stimuli called events. At any given point in time, to every event is associated a
piece of code known as the handler for this event. A global scheduler, known as
the event loop, repeatedly waits for an event to occur and invokes the associated
handler. Performing a complex task requires to coordinate several event handlers
by exchanging appropriate events.
arXiv:1210.4263v2 [cs.PL] 17 Oct 2012
2
Unlike threads, event handlers do not have an associated stack; event-driven
programs are therefore more lightweight and often faster than their threaded
counterparts. However, because it splits the flow of control into multiple tiny
event handlers, event-driven programming is generally deemed more difficult and
error-prone. Additionally, event-driven programming alone is often not powerful
enough, in particular when accessing blocking APIs or using multiple processor
cores; it is then necessary to write hybrid code, that uses both preemptively-
scheduled threads and cooperatively-scheduled event handlers, which is even more
difficult.
Since event-driven programming is more difficult but more efficient than
threaded programming, it is natural to want to at least partially automate
it. Continuation-Passing C (CPC [10]) is an extension of the C programming
language for writing concurrent systems. The CPC programmer manipulates
very lightweight threads, choosing whether they should be cooperatively or
preemptively scheduled at any given point. The CPC program is then processed
by the CPC translator, which produces highly efficient sequentialised event-loop
code, and uses native threads to execute the preemptive parts. The translation
from threads into events is performed by a series of classical source-to-source
program transformations: splitting of the control flow into mutually recursive
inner functions, lambda lifting of these functions created by the splitting pass,
and CPS conversion of the resulting code. This approach retains the best of
both worlds: the relative convenience of programming with threads, and the low
memory usage of event-loop code.
The many styles of events
Not all event-driven programs look the same:
several styles and implementations exist, depending on the programmer’s taste.
Since event-driven programming consists in manually handling the control flow
and data flow of each task, a tedious and error-prone activity, the programmer
often choses a style based on some trade-off between (his intuition of) efficiency
and code-readability, and then sticks with it in the whole program. Even if
the representation of control or data turns out to be suboptimal, changing it
would generally require a complete refactoring of the program, not likely to be
undertaken for an uncertain performance gain. In large event-driven programs,
written by several people or over a long timespan, it is even possible to find a
mix of several styles making the code even harder to decipher.
For example, the transformations performed by the CPC translator yield
event-driven code where control flow is encoded as long, intricate chains of
callbacks, and where local state is stored in tiny data structures, repeatedly
copied from one event-handler to the next. We can afford these techniques
because we generate the code automatically. Hand-written programs often use
less tedious approaches, such as state machines to encode control flow and coarse
long-lived data structures to store local state; these are easier to understand
and debug but might be less efficient. Since the transformations performed by
the CPC translator are completely automated, it offers an ideal opportunity
3
to generate several event-driven variants of the same threaded program, and
compare their efficiency.
Contributions
We first review existing translators from threads to events
(Section 2), and analyse several examples of event-driven styles found in real-
world programs (Section 3). We identify two typical kinds of control-flow and
data-flow encodings: callbacks or state machines for the control flow, and coarse-
grained or minimal data structures for the data flow.
We then propose a set of automatic program transformations to produce each
of these four variants (Section 4). Control flow is translated by splitting and CPS
conversion to produce callbacks; adding a pass of defunctionalisation yields state
machines. Data flow is translated either by lambda lifting, to produce minimal,
short-lived data structures, or using shared environments for coarse-grained ones.
Finally, we implement eCPC, a variant of CPC using shared environments
instead of lambda lifting to handle the data flow in the generated programs
(Section 5). We find out that, although rarely used in real-world event-driven
programs because it is tedious to perform manually, lambda lifting yields faster
code than environments in most of our benchmarks. To the best of our knowledge,
CPC is currently the only threads-to-events translator using lambda lifting.
2 Related work
The translation of threads into events has been rediscovered many times [5,11,12].
In this section, we review existing solutions, and observe that each of them gener-
ates only one particular kind of event-driven style. As we shall see in Section 4, we
believe that these implementations are in fact a few classical transformation tech-
niques, studied extensively in the context of functional languages, and adapted to
imperative languages, sometimes unknowingly, by programmers trying to solve
the issue of writing events in a threaded style.
The first example known to us is Weave, an unpublished tool used at IBM
in the late 1990’s to write firmware and drivers for SSA-SCSI RAID storage
adapters [11]. It translates annotated Woven-C code, written in threaded style,
into C code hooked into the underlying event-driven kernel.
Adya et al. [1] provide a detailed analysis of control flow in threads and events
programs, and implement adaptors between event-driven and threaded code to
write hybrid programs mixing both styles.
Duff introduces a technique, known as Duff’s device [4], to express general loop
unrolling directly in C, using the
switch
statement. Much later, this technique
has been employed multiple times to express state machines and event-driven
programs in a threaded style: protothreads [5], FairThreads’ automata [2]. These
libraries help keep a clearer flow of control but they provide no automatic handling
of data flow: the programmer is expected to save local variables manually in his
own data structures, just like in event-driven style.
Tame [12] is a C++ language extension and library which exposes events
to the programmer but does not impose event-driven style: it generates state
4
machines to avoid the stack ripping issue and retain a thread-like feeling. Similarly
to Weave, the programmer needs to annotate local variables that must be saved
across context switches.
TaskJava [6] implements the same idea as Tame, in Java, but preserves local
variables automatically, storing them in a state record. Kilim [17] is a message-
passing framework for Java providing actor-based, lightweight threads. It is also
implemented by a partial CPS conversion performed on annotated functions, but
contrary to TaskJava, it works at the JVM bytecode level.
MapJAX [13] is a conservative extension of Javascript for writing asynchronous
RPC, compiled to plain Javascript using some kind of ad-hoc splitting and CPS
conversion. Interestingly enough, the authors note that, in spite of Javascript’s
support for nested functions, they need to perform “function denesting” for
performance reasons; they store free variables in environments (“closure objects”)
rather than using lambda lifting.
AC [7] is a set of language constructs for composable asynchronous I/O
in C and C++. Harris et al. introduce
do..finish
and
async
operators to
write asynchronous requests in a synchronous style, and give an operational
semantics. The language constructs are somewhat similar to those of Tame but
the implementation is very different, using LLVM code blocks or macros based
on GCC’s nested functions rather than source-to-source transformations.
3 Control flow and data flow in event-driven code
Because event-driven programs do not use the native call stack to store return
addresses and local variables, they must encode the control flow and data flow
in data structures, the bookkeeping of which is the programmer’s responsibility.
This yields a diversity of styles among event-driven programs, depending on the
programmer’s taste, creativity, and his perception of efficiency. In this section,
we analyse how control flow and data flow are encoded in several examples of
real-world event-driven programs, and compare them to equivalent threaded-style
programs.
3.1 Control flow
Two main techniques are used to represent the control flow in event-driven
programming: callbacks and state machines.
Callbacks Most of the time, control flow is implemented with callbacks. Instead
of performing a blocking function call, the programmer calls a non-blocking
equivalent that cooperates with the event loop, providing a function pointer to be
called back once the non-blocking call is done. This callback function is actually
the continuation of the blocking operation.
Developing large programs raises the issue of composing event handlers.
Whereas threaded code has return addresses stored on the stack and a standard
calling sequence to coordinate the caller and the callee, event-driven code needs to
5
define its own strategy to layer callbacks, storing the callback to the next layer in
some data structure associated with the event handler. The “continuation stack”
of callbacks is often split in various places of the code, each callback encoding its
chunk of the stack in an ad-hoc manner.
Consider for instance the accept loop of an HTTP server that accepts clients
and starts two tasks for each of them: a client handler, and a timeout to disconnect
idle clients. With cooperative threads, this would be implemented as a mere
infinite loop with a cooperation point. The following code is an example of such
an accept loop written with CPC.
cp s in t c p c_ a c ce p t ( i nt f d ) {
c pc _ io _w a it ( fd , C P C_ IO _ IN ) ;
r et ur n a cc e pt ( fd , NU LL , N U LL ) ;
}
cp s in t a c ce p t _l o o p ( i nt f d ) {
in t c li en t _f d ;
w hi le ( 1) {
c li e n t_ f d = c p c_ a c c ep t ( f d ) ;
c pc _ sp a wn h t tp T i me o ut ( c l ie n t_ fd , c li e n tT i me o u t );
c pc _ sp aw n h tt p Cl i en t H an d le r ( c li en t _f d ) ;
}
}
The programmer calls
cpc spawn accept loop(fd)
to create a new thread that
runs the accept loop; the function
accept loop
then waits for incoming connec-
tions with the cooperating primitive
cpc io wait
, and creates two new threads
for each client (
httpTimeout
and
httpClientHandler
), which kill each other
upon completion. Note that cooperative functions are annotated with the
cps
keyword; such cps functions are to be converted into event-driven style by the
CPC translator.
Figure 1 shows the (very simplified) code of the accepting loop in Polipo,
a caching web-proxy written by Chroboczek.
3
This code is equivalent to the
threaded version above, and uses several levels of callbacks.
In Polipo, the accept loop is started by a call to
schedule accept(fd, http-
Accept, NULL)
. This function stores the pointer to the (second-level) callback
httpAccept
in the
handler
field of the
request
data structure (line 10), and
registers a (first-level) callback to
do scheduled accept
, through
registerFd-
Event
. Each time the file descriptor
fd
becomes ready (not shown), the event
loop calls the (first-level) callback
do scheduled accept
, which performs the
actual
accept
system call (line 23) and finally invokes the (second-level) callback
httpAccept stored in request->handler (line 24).
This callback schedules two new event handlers,
httpTimeout
and
http-
ClientHandler
. The former is a timeout handler, registered by
scheduleTime-
Event
(line 35); the latter reacts I/O events to read requests from the client, and
is registered by
do stream buf
(line 41). Note that those helper functions that
register callbacks with the event loop use other intermediary callbacks themselves,
just like schedule accept uses do schedule accept.
3http://www.pps.univ-paris-diderot.fr/~jch/software/polipo/.
6
FdEventHandlerPtr
2s ch e d ul e _ ac c e pt ( in t f d ,
in t ( * ha nd l er ) (i nt , F dE ve nt H an dl e rP tr , A cc e pt Re q ue st P tr ) ,
4void *data) {
FdEventHandlerPtr event;
6A cc e pt R eq u es t Re c r eq u es t ;
in t do n e ;
8
r eq u e st . f d = fd ;
10 request.handler = handler;
r eq u es t . d at a = da t a ;
12 e ve n t = r e g is t e rF d E ve n t ( fd , P OL L O UT | P O LL IN ,
do_scheduled_accept ,
14 s iz eo f ( r eq u es t ) , & re q ue st );
r et ur n e ve n t ;
16 }
18 in t
d o_ s c he d u le d _ a cc e p t ( in t s t at us , F dE v e nt H a n dl e r Pt r e ve n t ) {
20 A cc e p tR e qu e s tP t r re q ue s t = ( A c c ep t R eq u es t P tr )& e v en t - > d at a ;
in t r c , d on e ;
22
rc = a c ce p t ( re qu e st - > fd , N U LL , N UL L ) ;
24 d on e = r eq u es t - > h an dl e r (r c , e ve nt , r e qu e st ) ;
r et u rn d on e ;
26 }
28 in t
h tt p A cc e pt ( in t f d , F d E ve n t Ha n d le r P tr e ve n t ,
30 A cc e pt R eq u es t Pt r r eq u es t ) {
H TT P Co n ne c ti o nP t r c o nn e ct i on ;
32 T im e Ev e nt H an d l er P tr t i me ou t ;
34 c on ne ct io n = htt pM ake Co n ne ct io n () ;
t im eo u t = s ch e du l eT i me E v en t ( c li e nt Ti m eo u t ,
36 htt pTi meou tHan dler ,
s iz eo f ( c on n ec t io n ) , & co n ne c ti on );
38 c on n ec t i on - > f d = f d ;
c on ne c ti on - > t im e ou t = t im eo u t ;
40 c on ne c ti o n - > f la gs = C O N N_ R EA D ER ;
d o_ s tr e am _ bu f ( I O_ R EA D | I O_ NO TN O W ,
42 con ne c ti on - > fd , 0, &co nne cti on - > r eq bu f ,
C HU NK _S IZ E , h t tp Cl ie n tH an dl e r , c o nn ec ti o n );
44 return 0;
}
Fig. 1. Accept loop callbacks in Polipo (simplified)
In the original Polipo code, things are even more complex since
schedule
accept
is called from
httpAcceptAgain
, yet another callback that is registered
by
httpAccept
itself in some error cases. The control flow becomes very hard to
follow, in particular when errors are triggered: each callback must be prepared
to cope with error codes, or to follow-up the unexpected value to the next layer.
In some parts of the code, this style looks a lot like an error monad manually
interleaved with a continuation monad. Without a strict discipline and well-
7
defined conventions about composition, the flexibility of callbacks easily traps
the programmer in a control-flow and storage-allocation maze.
State machines When the multiplication of callbacks becomes unbearable, the
event-loop programmer might refactor his code to use a state machine. Instead
of splitting a computation into as many callbacks as it has atomic steps, the
programmer registers a single callback that will be called over and over until the
computation is done. This callback implements a state machine: it stores the
current state of the computation into an ad-hoc data structure, just like threaded
code would store the program counter, and uses it upon resuming to jump to the
appropriate location.
Figure 2 shows how the initial handshake of a BitTorrent connection is handled
in Transmission,
4
a popular and efficient BitTorrent client written in (mostly)
event-driven style. Until the handshake is over, all data arriving from a peer is
handed over by the event loop to the
canRead
callback. This function implements
a state machine, whose state is stored in the
state
field of a
handshake
data
structure. This field is initialised to
AWAITING HANDSHAKE
when the connection is
established (not shown) and updated by the functions responsible for each step
of the handshake.
The first part of the handshake is dispatched by
canRead
to the
readHand-
shake
function (line 7). It receives the buffer
inbuf
containing the bytes received
so far; if not enough data has yet been received to carry on the handshake, it
returns
READ LATER
to
canRead
(line 26), which forwards it to the event loop to
be called back when more data is available (line 16). Otherwise, it checks the
BitTorrent header (line 28), parses the first part of the handshake, registers a
callback to send a reply handshake (not shown), and finally updates the state
(line 33) and returns
READ NOW
to indicate that the rest of the handshake should
be processed immediately (line 34).
Note what happens when the BitTorrent header is wrong (line 28): the function
tr handshakeDone
is called with
false
as its second parameter, indicating that
some error occurred. This function (not shown) is responsible for invoking the
callback
handshake->doneCB
and then deallocating the
handshake
structure.
This is another example of the multiple layers of callbacks mentioned above.
If the first part of the handshake completes without error,
canRead
then
dispatches the buffer to
readPeerId
which completes the handshake (line 10). Just
like
readHandshake
, it returns
READ LATER
if the second part of the handshake
has not arrived yet (line 41) and finally calls
tr handshakeDone
with
true
to
indicate that the handshake has been successfully completed (line 45).
In the original code, ten additional states are used to deal with the various steps
of negotiating encryption keys. The last of these steps finally rolls back the state to
AWAITING HANDSHAKE
and the keys are used by the function
tr peerIoReadBytes
to decrypt the rest of the exchange transparently. The state machine approach
makes the code slightly more readable than using pure callbacks.
4http://www.transmissionbt.com/.
8
1sta tic R ead Sta t e
c an Re a d ( st ru c t e v bu ff e r * i nb uf , t r_ h an d sh a ke * h an d sh a ke ) {
3R ea d St at e r et = R EA D _N OW ;
5w hi l e ( re t = = R E AD _ N OW ) {
s wi tc h ( h an d sh ak e - > st a te ) {
7c as e A W AI TI N G_ H AN D SH A KE :
re t = r ea d H an d sh a k e ( h a nd s ha ke , i nb u f );
9b re ak ;
case AWAITING_PEER_ID:
11 re t = r ea d Pe e r Id ( h an ds h ak e , i nb u f );
b re ak ;
13 /* . .. ca ses d eal ing w ith e ncr y pti on omi tte d */
}
15 }
r et u rn r et ;
17 }
19 static int
r ea d Ha n d sh a ke ( t r_ h an d sh a k e * h an ds h ak e ,
21 s tr uc t e vb u ff e r * i nb u f ) {
u in t8 _ t p st r [ 20 ] , r es e rv e d [ HA N DS H AK E _ FL A GS _ LE N ] ,
23 hash[SHA_DIGEST_LENGTH ];
25 if ( e v b uf f er _ ge t _ le n gt h ( i n bu f ) < I N CO M I NG _ HA N D SH A K E_ L EN )
r et ur n R EA D _L AT E R ;
27 t r_ p ee r I oR e a dB y te s ( h an d sh a ke - > io , i n bu f , p st r , 20 ) ;
if ( m e m cm p ( p st r , " \ 02 3 B i t To r r e nt p r o to c o l " , 2 0) )
29 r et ur n t r _h a nd s h ak e Do n e ( ha n ds h ak e , f al s e );
t r_ p ee r I oR e a dB y te s ( h an d sh a ke - > io , i n bu f , r e se rv e d , . .. );
31 t r_ p ee r I oR e a dB y te s ( h an d sh a ke - > io , i n bu f , h as h , . .. );
/* . .. p a rs i ng o f han dsh a ke and se ndi n g re p ly o mit t ed */
33 h an ds h ak e - > s ta t e = A W A IT I NG _ P EE R _I D ;
r et ur n R EA D _N OW ;
35 }
37 static int
r ea d Pe e rI d ( t r _h a nd s ha k e * h an d sh ak e , s t ru c t e v bu f fe r * i n bu f ) {
39 u in t8 _ t p e er _ id [ P E ER _ ID _L E N ];
41 if ( e v b uf f er _ ge t _ le n gt h ( i n bu f ) < P E ER _ ID _ LE N )
r et ur n R EA D _L AT E R ;
43 t r_ p ee r I oR e a dB y te s ( h an d sh a ke - > io , i n bu f , p e er _i d , . .. ) ;
/* . .. p ars ing o f pe er id o mit ted * /
45 r et ur n t r _h a nd s h ak e Do n e ( ha n ds h ak e , t ru e );
}
Fig. 2. Handshake state-machine in Transmission (simplified)
3.2 Data flow
Since each callback function performs only a small part of the whole computation,
the event-loop programmer needs to store temporary data required to carry
on the computation in heap-allocated data structures, whereas stack-allocated
variables would sometimes seem more natural in threaded style. The content of
9
these data structures depends heavily on the program being developed but we
can characterise some common patterns.
Event loops generally provide some means to specify a
void*
pointer when
registering an event handler. When the expected event triggers, the pointer is
passed as a parameter to the callback function, along with information about the
event itself. This allows the programmer to store partial results in a structure of
his choice, and recover it through the pointer without bothering to maintain the
association between event handlers and data himself.
Coarse-grained, long lived data structures These data structures are usually large
and coarse-grained. Each of them correponds to some meaningful object in the
context of the program, and is passed from callback to callback through a pointer.
For instance, the
connection
structure used in Polipo (Figure 1) is allocated
by
httpMakeConnection
when a connection starts (line 34) and passed to the
callbacks
httpTimeoutHandler
and
httpClientHandler
through the registering
functions
scheduleTimeEvent
(line 35) and
do stream buf
(line 41). It lives as
long as the HTTP connection it describes and contains no less than 22 fields:
fd
,
timeout
,
buf
,
pipelined
,etc. The
tr handshake
structure passed to
canRead
in Transmission is similarly large, with 18 fields.
Some of these fields need to live for the whole connection (eg.
fd
which stores
the file descriptor of the socket) but others are used only transiently (eg.
buf
which is filled only when sending a reply), or even not at all in some cases (eg. the
structure
HTTPConnectionPtr
is used for both client and server connections, but
the
pipelined
field is never used in the client case). Even if it wastes memory in
some cases, it would be too much of a hassle for the programmer to track every
possible data flow in the program and create ad-hoc data structures for each of
them.
Minimal, short-lived data structures In some simple cases, however, the event-loop
programmer is able to allocate very small and short-lived data structures. These
minimal data structures are allocated directly within an event handler and are
deallocated when the associated callback returns. They might even be allocated
on the stack by the programmer and copied inside the event-loop internals by
the helper function registering the event handler. The overhead is therefore kept
as low as possible.
For instance, the function
schedule accept
passes a tiny, stack-allocated
structure
request
to the helper function
registerFdEvent
(Fig. 1, line 12).
This structure is of type
AcceptRequestRec
(not shown), which contains only
three fields: an integer
fd
and two pointers
handler
and
data
. It is copied by
registerFdEvent
in the event-loop data structure associated with the event,
and freed automatically after the callback
do scheduled accept
has returned;
it is as short-lived and (almost) as compact as possible.
As it turns out, creating truly minimal structures is hard:
AcceptRequestRec
could in fact be optimised to get rid off the fields
data
, which is always
NULL
in
practice in Polipo, and
fd
, which is also present in the encapsulating
event
data
structure. Finding every such redundancy in the data flow of a large event-driven
10
program would be a daunting task, hence the spurious and redundant fields used
to lighten the programmer’s burden.
4 Generating various event-driven styles
In this section, we first demonstrate the effect of CPC transformation passes on
a small example; we show that code produced by the CPC translator is very
close to event-driven code using callbacks for control flow, and minimal data
structures for data flow (Section 4.1). We then show how two other classical
translation passes produce different event-driven styles: defunctionalising inner
function yields state machines (Section 4.2), and encapsulating local variables in
shared environments yields larger, long-lived data structures with full context
(Section 4.3).
4.1 The CPC compilation technique
Consider the following function, which counts seconds down from an initial value
x
to zero.
cp s vo i d c o u nt d ow n ( i n t x ) {
w hi le ( x > 0 ) {
p ri nt f ( " %d \ n " , x- - );
cpc _s le ep (1 );
}
pri ntf ( " tim e is o ver ! \ n");
}
This function is annotated with the
cps
keyword to indicate that it yields to the
CPC scheduler. This is necessary because it calls the CPC primitive
cpc sleep
,
which also yields to the scheduler.
The CPC translator is structured in a series of proven source-to-source
transformations [10], which turn a threaded-style CPC program into an equivalent
event-driven C program. Boxing first encapsulates a small number of variables in
environments. Splitting then splits the flow of control of each cps function into a
set of inner functions. Lambda lifting removes free local variables introduced by
the splitting step; it copies them from one inner function to the next, yielding
closed inner functions. Finally, the program is in a form simple enough to perform
a one-pass partial CPS conversion. The resulting continuations are used at
runtime to schedule threads.
In the rest of this section, we show how splitting, lambda lifting and CPS
conversion transform the function
countdown
. The boxing pass has no effect on
this example because it only applies to extruded variables, the address of which
is retained by the “address of” operator (&).
Splitting The first transformation performed by the CPC translator is splitting.
Splitting has been first described by van Wijngaarden for Algol 60 [20], and later
adapted by Thielecke to C, albeit in a restrictive way [19]. It translates control
structures into mutually recursive functions.
11
Splitting is done in two steps. The first step consists in replacing every control-
flow structure, such as
for
and
while
loops, by its equivalent in terms of
if
and
goto.
cp s vo i d c o u nt d ow n ( i n t x ) {
loop:
if ( x <= 0) g ot o t im e o ut ;
p ri nt f ( " %d \ n " , x- - );
c pc _s le ep ( 1) ;
goto loop;
timeout:
pri ntf ( " tim e is o ver ! \ n");
}
The second step uses the fact that
goto
are equivalent to tail calls [18]. It
translates every labelled block into an inner function, and every jump to that
label into a tail call (followed by a return) to that function.
1cp s vo i d c o u nt d ow n ( i n t x ) {
cp s v o id lo o p () {
3if ( x < = 0) { t i me o ut () ; r etu rn; }
p ri nt f ( " %d \ n " , x- - );
5cpc _sl e ep (1) ; loo p (); r etu rn;
}
7cp s vo i d t im e ou t ( ) { p ri n tf ( " t im e i s o ve r ! \n " ); r et u rn ; }
l oo p () ; r e tu r n ;
9}
Fig. 3. CPC code after splitting
Splitting yields a program where each cps function is split in several mutually
recursive, atomic functions, very similar to event handlers. Additionally, the tail
positions of these inner functions are always either:
areturn statement (for instance, on line 7 in the previous example),
a tail call to another cps function (line 3),
a call to an external cps function followed by a call to an inner cps function
(line 5).
We recognise the typical patterns of an event-driven program that we studied in
Section 3: respectively returning a value to the upper layer (Fig. 1 (4)), calling a
function to carry on the current computation (Fig. 2 (1)), or calling a function
with a callback to resume the computation once it has returned (Fig. 1 (2)).
Another effect of splitting is the introduction of free variables, which are
bound to the original encapsulating function rather than the new inner ones.
For instance, the variable
x
is free in the function
loop
above. Because inner
functions and free variables are not allowed in C, we perform a pass of lambda
lifting to eliminate them.
12
Lambda lifting The CPC translator then makes the data flow explicit with a
lambda-lifting pass. Lambda lifting, also called closure conversion, is a standard
technique to remove free variables introduced by Johnsson [8]. It is also performed
in two steps: parameter lifting and block floating.
Parameter lifting binds every free variable to the inner function where it
appears (for instance
x
to
loop
on line 2 below). The variable is also added as a
parameter at every call point of the function (lines 5 and 8).
1cp s vo i d c o u nt d ow n ( i n t x ) {
cp s vo i d l o op ( i n t x ) {
3if ( x < = 0) { t i me o ut () ; r etu rn; }
p ri nt f ( " %d \ n " , x- - );
5c pc _ s le e p ( 1 ) ; l o op ( x ) ; r e t ur n ;
}
7cp s vo i d t im e ou t ( ) { p ri n tf ( " t im e i s o ve r ! \n " ); r et u rn ; }
l oo p ( x ); r e tu r n ;
9}
Note that because C is a call-by-value language, lifted parameters are duplicated
rather than shared and this step is not correct in general. It is however sound in
the case of CPC because lifted functions are called in tail position: they never
return, which guarantees that at most one copy of each parameter is reachable
at any given time [10]. Block floating is then a trivial extraction of closed, inner
functions at top-level.
Lambda lifting yields a program where the data is copied from function to
function, each copy living as long as the associated handler. If some piece of data
is no longer needed during the computation, it will not be copied in the subsequent
handlers; for instance, the variable
x
is not passed to the function
timeout
. Hence,
lambda lifting produces short-lived, almost minimal data structures.
CPS conversion Finally, the control flow is made explicit with a CPS conversion
[14, 16]. The continuations store callbacks and their parameters in a regular
stack-like structure
cont
with two primitive operations:
push
to add a function
on the continuation, and invoke to call the first function of the continuation.
cp s v oi d lo o p ( i nt x , c on t * k ) {
if ( x <= 0) { t i m eo u t ( k ); r e tu r n ; }
p ri nt f ( " %d \ n " , x- - );
c pc _ s le e p ( 1 , p u sh ( l o op , x , k ) ); r e tu r n ;
}
cp s vo i d t i me o ut ( c o nt * k) {
pri ntf ( " tim e is o ver ! \ n");
invoke(k); return;
}
cp s v oid c oun t do w n ( int x , c ont * k ) { l oo p (x, k); r e tu r n ; }
CPS conversion turns out to be an efficient and systematic implementation of
the layered callback scheme described in Section 3.1. Note that, just like lambda
lifting, CPS conversion is not correct in general in an imperative call-by-value
language, because of duplicated variables on the continuation. It is however
13
correct in the case of CPC, for reasons similar to the correctness of lambda
lifting [10].
4.2 Defunctionalising inner functions
Defunctionalisation is a compilation technique introduced by Reynolds to translate
higher-order programs into first-order ones [15]. It maps every first-class function
to a first-order structure that contains both an index representing the function,
and the values of its free variables. These data structures are usually a constructor,
whose parameters store the free variables. Function calls are then performed by
a dedicated function that dispatches on the constructor, restores the content of
the free variables and executes the code of the relevant function.
The dispatch function introduced by defunctionalisation is very close to a state
automaton. It is therefore not surprising that defunctionalising inner functions
in CPC yields an event-driven style similar to state machines (Section 3.1).
Defunctionalisation of CPC programs Usually, defunctionalisation contains an
implicit lambda-lifting pass, to make free variables explicit and store them in
constructors. For example, a function
fnx=>x+y
would be replaced by an
instance of
LAMBDA of int
, with the free variable
y
copied in the constructor
LAMBDA
. The dispatch function would then have a case:
dispatch (LAMBDA y,
x)=x+y.
In this discussion, we wish to decouple this data-flow transformation from
the translation of the control flow into a state machine. Therefore, we define the
dispatch function as an inner function which merges the content of the other inner
functions but still contains free variables. This is possible because the splitting
pass does not create any closure: it introduces inner functions with free variables,
but these are always called directly, not stored as first-class values whose free
variables must be captured.
Consider again our countdown example after the splitting pass (Fig. 3). Once
defunctionalised, it contains a single inner function
dispatch
that dispatches on
an enumeration representing the former inner function loop and timeout.
enu m sta te { LO OP , T IM EO UT };
2cp s vo i d c o u nt d ow n ( i n t x ) {
cp s vo i d d i sp a tc h ( e nu m s ta te s ) {
4s wi tc h ( s ) {
case LOOP:
6if ( x < = 0) { d i sp a tc h ( T IM EO U T ); r et u rn ; }
p ri nt f ( " %d \ n " , x- - );
8c pc _ sl e ep ( 1) ; di s p at c h ( LO O P ); r et u rn ;
c as e T I ME OU T :
10 p ri nt f ( " ti m e is o ve r ! \n " ) ; r et u rn ;
}
12 }
d is p at c h ( LO O P ); r et u rn ;
14 }
14
As an optimisation, the recursive call to
dispatch
on line 6 can be replaced by a
goto
statement. However, we cannot replace the call that follows the cps function
cpc sleep(1)
on line 8, since we will need to provide
dispatch
as a callback to
cpc sleep during CPS conversion, to avoid blocking.
We must then eliminate free variables and inner functions, with a lambda-
lifting pass. It is still correct because defunctionnalisation does not break the
required invariants on tail calls. We finally reach code that is similar in style to
the state-machine shown in Fig. 2.
cp s v oi d d is p a tc h ( e n um s t at e s , i nt x ) {
s wi tc h ( s ) {
case LOOP:
if ( x <= 0) g ot o t i me o u t _l a b e l ;
p ri nt f ( " %d \ n " , x- - );
c pc _ s le e p ( 1 ) ; d i s pa t c h ( LO O P , x ) ; r e t ur n ;
c as e T I ME OU T : t i m eo u t_ l ab e l :
pri ntf ( " tim e is o ver ! \ n"); r et u rn ;
}
}
cp s v oid c oun t do w n ( int x ) { d i sp a tch ( LOO P , x ); ret urn ; }
In this example, we have also replaced the first occurrence of
dispatch
with
goto timeout label
, as discussed above, which avoids the final function call
when the counter reaches zero.
If we ignore the
switch
– which serves mainly as an entry point to the
dispatch function, `a la Duff’s device [4] – we recognise the intermediate code
generated during the first step of splitting, as having an explicit control flow
using gotos but without inner functions. In retrospect, the second step of split-
ting, which translates gotos to inner functions, can be considered as a form a
refunctionalisation, the left-inverse of defunctionalisation [3].
Benefits The translation presented here is in fact a partial defunctionalisation:
each cps function in the original program gets its own dispatch function, and only
inner functions are defunctionalised. A global defunctionalisation would imply a
whole program analysis, would break modular compilation, and would probably
not be very efficient because C compilers are optimised to compile hand-written,
reasonably-sized functions rather than a giant state automaton with hundreds
of states. On the other hand, since it is only partial, this translation does not
eliminate the need for a subsequent CPS conversion step to translate calls to
external cps functions into operations on continuations.
Despite adding a new translation step while keeping the final CPS conversion,
this approach has several advantages over the CPS conversion of many smaller,
mutually recursive functions performed by the current CPC translator. First, we
do not pay the cost of a CPS call for inner functions. This might bring significant
speed-ups in the case of tight loops or complex control flows. Moreover, it leaves
with much more optimisation opportunities for the C compiler, for instance to
store certain variables in registers, and reduces the number of operations on the
continuations. It also makes debugging easier, avoiding numerous hops through
ancillary cps functions.
15
4.3 Shared environments
The two main compilation techniques to handle free variables are lambda lifting,
illustrated in Section 4.1 and discussed extensively in a previous article [10], and
environments. An environment is a data structure used to capture every free
variable of a first-class function when it is defined; when the function is later
applied, it accesses its variables through the environment. Environments add
a layer of indirection, but contrary to lambda lifting they do not require free
variables to be copied on every function call.
In most functional languages, each environment represents the free variables
of a single function; a pair of a function pointer and its environment is called
a closure. However, nothing prevents in principle an environment from being
shared between several functions, provided they have the same free variables. We
use this technique to allocate a single environment shared by inner functions,
containing all local variables and function parameters.
An example of shared environments Consider once again our countdown example
after splitting (Fig. 3). We introduce an environment to contain the local variables
of countdown (here, there is only x).
1str uct e nv_ c oun tdo w n { int x } ;
cp s vo i d c o u nt d ow n ( i n t x ) {
3str uct e nv_ c oun tdo w n * e =
m al lo c ( s iz e of ( s t ru ct e n v_ c o un t do w n )) ;
5e ->x = x ;
cp s vo i d l oo p ( s tr u ct e n v_ c o un t do w n * e ) {
7if ( e - > x < = 0 ) { t i m eo u t ( ); r e tu r n ; }
p ri nt f ( " %d \ n " , e- > x - -) ;
9c pc _ s le e p ( 1 ) ; l o op ( e ) ; r e t ur n ;
}
11 cp s vo i d t im e ou t ( s tr u ct e n v_ c ou n td o wn * e ) {
pri ntf ( " tim e is o ver ! \ n");
13 f re e ( e ); r e tu r n ;
}
15 l oo p ( e ); r e tu r n ;
}
The environment is allocated (line 4) and initialised (line 5) when the function
countdown
is entered. The inner functions access
x
through the environment,
either to read (line 7) or to write it (line 8). A pointer to the environment is
passed from function to function (line 9); hence all inner functions share the
same environment. Finally, the environment is deallocated just before the last
inner function exits (line 13).
The resulting code is similar in style to hand-written event-driven code, with
a single, heap-allocated data structure sharing the local state between a set of
callbacks. Note that inner functions have no remaining free variable and can
therefore be lambda-lifted trivially.
Benefits Encapsulating local variables in environments avoids having to copy
them back and forth between the continuation and the native call stack. However,
16
it does not necessarily mean that the generated programs are faster; in fact,
lambda-lifted programs are often more efficient (Section 5). Another advantage of
environments is that they make programs easier to debug, because the local state
is always fully available, whereas in a lambda-lifted program “useless” variables
are discarded as soon as possible, even though they might be useful to understand
what went wrong before the program crashed.
5 Evaluation
In this section, we describe the implementation of eCPC, a CPC variant using
shared environments instead of lambda lifting to encapsulate the local state
of cps functions. We then compare the efficiency of programs generated with
eCPC and CPC, and show that the latter is more efficient in most cases. This
demonstrates the benefits of generating events automatically: most real-world
event-driven programs are based on environments, because they are much easier
to use, although systematic lambda lifting would probably yield faster code.
5.1 Implementation
The implementation of eCPC is designed to reuse as much of the existing CPC
infrastructure as possible. The eCPC translator introduces two new passes:
preparation and generation of environments. The former replaces the boxing pass;
the latter replaces lambda lifting.
Environment preparation Environments must be introduced before the splitting
pass for two reasons. First, it is easier to identify the exit points of cps functions,
where the environments must be deallocated, before they are split into multiple,
mutually recursive, inner functions. Furthermore, these environment deallocations
occur in tail position, and have therefore an impact on the splitting pass itself.
Although deallocation points are introduced before splitting, neither alloca-
tion nor initialisation or indirect memory accesses are performed at this stage.
Environments introduced during this preparatory pass are empty shells, of type
void*
, that merely serve to mark the deallocation points. This is necessary
because not all temporary variables have been introduced at this stage (the
splitting pass will generate more of them). Deciding which variables will be stored
in environments is delayed to a later pass.
This preparatory pass also needs to modify how return values are handled. In
the original CPC, return values are written directly in the continuation when
the returning function invokes its continuation. This is made possible by the
convention that the return value of a cps function is the last parameter of its
continuation, hence at a fixed position on the continuation stack. Such is not
the case in eCPC, where function parameters are kept in the environment rather
than copied on the continuation.
17
In eCPC, the caller function passes a pointer to the callee, indicating the
address where the callee must write its return value.
5
The preparatory pass
transforms every cps function returning a type
T
(different from
void
) into a
function returning
void
with an additional parameter of type
T
; call and return
points are modified accordingly. The implementation of CPC primitives in the
CPC runtime is also modified to reflect this change.
Environment generation After the splitting pass, the eCPC translator allocates
and initialises environments, and replaces variables by their counterpart in the
environment.
First, it collects local variables (except the environment pointer itself) and
function parameters and generates the layout of the associated environment.
Then, it allocates the environment and initialises the fields corresponding to the
function parameters. Because this initialisation is done at the very beginning of
the translated function, it does not affect the tails, thus preserving the correctness
of CPS conversion. Finally, every use of variables is replaced by its counterpart in
the environment, local variables are discarded, and inner functions are modified
to receive the environment as a parameter instead.
The CPS conversion is kept unchanged: the issue of return values is dealt
with completely in the preparatory pass and every cps function returns
void
at
this stage.
5.2 Benchmark results
We previously designed a set of benchmarks to compare CPC to other thread
libraries, and have shown that CPC is as fast as the fastest thread libraries
available to us while providing at least an order of magnitude more threads [10].
We reuse these benchmarks here to compare the speed of CPC and eCPC; our
experimental setup is unchanged, and detailed in our previous work.
Primitive operations We first measure the time of individual CPC primitives.
Table 1 shows the relative speed of eCPC compared with CPC for each of our
micro-benchmarks:
teCPC/tCPC
. A value greater than 1 indicates that eCPC is
slower than CPC. The slowest primitive operation in CPC is a cps function call
(cps-call), mostly because of the multiple layers of indirection introduced by
continuations. This overhead is even larger in the case of eCPC: performing a
cps function call is 2 to 3 times slower than with CPC.
This difference of cost for cps function calls probably has an impact on the
other benchmarks, making them more difficult to interpret. Context switches
(switch) are around 50 % slower on every architecture, which is surprisingly high
since they involve almost no manipulation of environments. Thread creation
(spawn) varies a lot across architectures: more than 3 times slower on the Pen-
tium M, but only 59 % slower on a MIPS embedded processor. Finally, condition
5
Note that a similar device would be necessary to implement defunctionalisation,
because the
dispatch
function is a generic callback which might receive many
different types of return values.
18
Table 1. Ratio of speeds of eCPC to CPC
Architecture cps-call switch condvar spawn
Core 2 Duo (x86-64) 2.45 1.67 1.13 2.18
Pentium M (x86) 2.35 1.75 1.08 3.12
MIPS-32 4KEc 2.92 1.43 0.91 1.59
variables (condvar) are even more surprising: not much slower on x86 and x86-64,
and even 9 % faster on MIPS. It is unclear which combination of factors leads
eCPC to outperform CPC on this particular benchmark only: we believe that
the larger number of registers helps to limit the number of memory accesses, but
we were not able to quantify this effect precisely.
These benchmarks of CPC primitives show that the allocation of environments
slows down eCPC in most cases, and confirms our intuition that avoiding boxing
as much as possible in favour of lambda lifting is very important in CPC.
Tic-tac-toe generator Unfortunately, benchmarking individual CPC primitives
gives little information on the performance of a whole program, because their cost
might be negligible compared to other operations. To get a better understanding
of the performance behaviour of eCPC, we wrote a trivial but highly concurrent
program with intensive memory operations: a tic-tac-toe generator that explores
the space of grids. It creates three threads at each step, each one receiving a copy
of the current grid, hence 39= 19 683 threads and as many copies of the grid.
We implemented two variants of the code, to test different schemes of memory
usage. The former is a manual scheme that allocates copies of the grids with
malloc
before creating the associated threads, and frees each of them in one of
the “leaf” threads, once the grid is completed. The latter is an automatic scheme
that declares the grids as local variables and synchronises their deallocation with
barriers; the grids are then automatically encapsulated, either by the boxing pass
(for CPC) or in the environment (for eCPC).
Our experiment consists in launching an increasing number of generator tasks
simultaneously, each one generating the 19683 grids and threads mentioned
above. We run up to 100 tasks simultaneously, ie. almost 2 000 000 CPC threads
in total, and the slowest benchmark takes around 3 seconds to complete on an
Intel Centrino 1,87 Ghz, downclocked to 800 MHz.
Finally, we compute the mean time per tic-tac-toe task. This ratio turns out
to be independent of the number of simultaneous tasks: both CPC and eCPC
scale linearly in this benchmark. We measured that eCPC is 20 % slower than
CPC in the case of manual allocation (13.2 vs. 11.0 ms per task), and 18 % slower
in the automatic case (31.3 vs. 26.5 ms per task). This benchmark confirms that
environments add a significant overhead in programs performing a lot a memory
accesses, although it is not as important as in benchmarks of CPC primitives.
Web servers To evaluate the impact of environments on more realistic programs,
we reuse our web server benchmark [9]. We measure the mean response time of a
19
small web server under the load of an increasing number of simultaneous clients.
The server is deliberately kept minimal, and uses one CPC thread per client. The
results are shown in Fig. 4. In this benchmark, the web server compiled with
eCPC is 12 % slower than the server compiled with CPC. Even on programs that
spend most of their time performing network I/O, the overhead of environments
remains measurable.
0
50
100
150
200
250
0 200 400 600 800 1000
Mean time per request (ms)
Number of simultaneous clients
eCPC (slope: 0.205)
CPC (slope: 0.183)
Fig. 4. Web server benchmark
6 Conclusions
Through the analyse of real-world programs, we have identified several typical
styles of control-flow and data-flow encodings in event-driven programs: callbacks
or state machines for the control flow, and coarse-grained or minimal data
structures for the data flow. We have then shown how these various styles can be
generated from a common threaded description, by a set of automatic program
transformations. Finally, we have implemented eCPC, a variant of the CPC
translator using shared environments instead of lambda lifting. We have found
out that, although rarely used in real-world programs because it is tedious to
perform manually, lambda lifting yields better performance than environments
in most of our benchmarks.
An interesting extension of our work would be to try and reverse our program
transformations, in order to reconstruct threaded code from event-driven pro-
grams. This could help analysing and debugging event-driven code, or migrating
legacy, hard-to-maintain event-driven programs like Polipo towards CPC or other
cooperative threads implementations.
Acknowledgments
The authors would like to thank Andy Key, who has inspired
this work, Juliusz Chroboczek for his continuous support, Peter Sewell, Chlo´e
Azencott and Alan Mycroft for their valuable comments.
20
References
1.
Adya, A., Howell, J., Theimer, M., Bolosky, W.J., Douceur, J.R.: Cooperative
task management without manual stack management. In: Ellis, C.S. (ed.) USENIX
Annual Technical Conference, General Track. pp. 289–302. USENIX (2002)
2.
Boussinot, F.: FairThreads: mixing cooperative and preemptive threads in C.
Concurrency and Computation: Practice and Experience 18(5), 445–469 (2006)
3.
Danvy, O., Nielsen, L.R.: Defunctionalization at work. In: PPDP. pp. 162–174.
ACM (2001)
4.
Duff, T.: Duff’s device (Nov 1983), electronic mail to R. Gomes, D. M. Ritchie and
R. Pike
5.
Dunkels, A., Schmidt, O., Voigt, T., Ali, M.: Protothreads: simplifying event-
driven programming of memory-constrained embedded systems. In: Campbell, A.T.,
Bonnet, P., Heidemann, J.S. (eds.) SenSys. pp. 29–42. ACM (2006)
6.
Fischer, J., Majumdar, R., Millstein, T.D.: Tasks: language support for event-driven
programming. In: Ramalingam, G., Visser, E. (eds.) PEPM. pp. 134–143. ACM
(2007)
7.
Harris, T., Abadi, M., Isaacs, R., McIlroy, R.: AC: composable asynchronous IO for
native languages. In: Lopes, C.V., Fisher, K. (eds.) OOPSLA. pp. 903–920. ACM
(2011)
8.
Johnsson, T.: Lambda lifting: Transforming programs to recursive equations. In:
FPCA. pp. 190–203 (1985)
9.
Kerneis, G., Chroboczek, J.: Are events fast? Tech. rep., PPS, Universit´e Paris 7
(Jan 2009)
10.
Kerneis, G., Chroboczek, J.: Continuation-passing C, compiling threads to events
through continuations. Higher-Order and Symbolic Computation 24, 239–279 (2011)
11.
Key, A.: Weave: translated threaded source (with annotations) to fibers with context
passing (ca 1995–2000), personal communication (2010)
12.
Krohn, M.N., Kohler, E., Kaashoek, M.F.: Events can make sense. In: USENIX
Annual Technical Conference. pp. 87–100 (2007)
13.
Myers, D.S., Carlisle, J.N., Cowling, J.A., Liskov, B.: Mapjax: Data structure
abstractions for asynchronous web applications. In: USENIX Annual Technical
Conference. pp. 101–114 (2007)
14.
Plotkin, G.D.: Call-by-name, call-by-value and the lambda-calculus. Theoretical
Computer Science 1(2), 125–159 (1975)
15.
Reynolds, J.C.: Definitional interpreters for higher-order programming languages.
Proceedings of the ACM National Conference 2(4), 717–740 (Aug 1972)
16.
Reynolds, J.C.: The discoveries of continuations. Lisp and Symbolic Computation
6(3-4), 233–248 (1993)
17.
Srinivasan, S., Mycroft, A.: Kilim: Isolation-typed actors for java. In: Vitek, J.
(ed.) ECOOP. Lecture Notes in Computer Science, vol. 5142, pp. 104–128. Springer
(2008)
18. Steele Jr., G.L., Sussman, G.J.: Lambda, the ultimate imperative (Mar 1976)
19.
Thielecke, H.: Continuations, functions and jumps. SIGACT News 30(2), 33–42
(1999)
20.
van Wijngaarden, A.: Recursive definition of syntax and semantics. In: Formal
Language Description Languages for Computer Programming. pp. 13–24. North-
Holland Publishing Company, Amsterdam, Netherlands (1966)
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We compare a set of web servers on a simple synthetic workload. We show that, on this particular bench-mark, event-driven code is as fast or faster than the fastest implementations using thread libraries.
Conference Paper
This paper describes Kilim, a framework that employs a combination of techniques to help create robust, massively concurrent systems in mainstream languages such as Java: (i) ultra-lightweight, cooperatively-scheduled threads (actors), (ii) a message-passing framework (no shared memory, no locks) and (iii) isolation-aware messaging. Isolation is achieved by controlling the shape and ownership of mutable messages – they must not have internal aliases and can only be owned by a single actor at a time. We demonstrate a static analysis built around isolation type qualifiers to enforce these constraints. Kilim comfortably scales to handle hundreds of thousands of actors and messages on modest hardware. It is fast as well – task-switching is 1000x faster than Java threads and 60x faster than other lightweight tasking frameworks, and message-passing is 3x faster than Erlang (currently the gold standard for concurrency-oriented programming).
Article
This paper examines the old question of the relationship between ISWIM and the λ-calculus, using the distinction between call-by-value and call-by-name. It is held that the relationship should be mediated by a standardisation theorem. Since this leads to difficulties, a new λ-calculus is introduced whose standardisation theorem gives a good correspondence with ISWIM as given by the SECD machine, but without the letrec feature. Next a call-by-name variant of ISWIM is introduced which is in an analogous correspondence withthe usual λ-calculus. The relation between call-by-value and call-by-name is then studied by giving simulations of each language by the other and interpretations of each calculus in the other. These are obtained as another application of the continuation technique. Some emphasis is placed throughout on the notion of operational equality (or contextual equality). If terms can be proved equal in a calculus they are operationally equal in the corresponding language. Unfortunately, operational equality is not preserved by either of the simulations.
Conference Paper
Reynolds's defunctionalization technique is a whole-program transformation from higher-order to first-order functional programs. We study practical applications of this transformation and uncover new connections between seemingly unrelated higher-order and first-order specifications and between their correctness proofs. Defunctionalization therefore appearsboth as a springboard for rev ealing new connections and as a bridge for transferring existing results between the first-order world and the higher-order world.
Conference Paper
This paper introduces AC, a set of language constructs for composable asynchronous IO in native languages such as C/C++. Unlike traditional synchronous IO interfaces, AC lets a thread issue multiple IO requests so that they can be serviced concurrently, and so that long-latency operations can be overlapped with computation. Unlike traditional asynchronous IO interfaces, AC retains a sequential style of programming without requiring code to use multiple threads, and without requiring code to be "stack-ripped" into chains of callbacks. AC provides an "async" statement to identify opportunities for IO operations to be issued concurrently, a "do..finish" block that waits until any enclosed "async" work is complete, and a "cancel" statement that requests cancellation of unfinished IO within an enclosing "do..finish". We give an operational semantics for a core language. We describe and evaluate implementations that are integrated with message passing on the Barrelfish research OS, and integrated with asynchronous file and network IO on Microsoft Windows. We show that AC offers comparable performance to existing C/C++ interfaces for asynchronous IO, while providing a simpler programming model.
Conference Paper
Event-driven programming is a popular model for writ- ing programs for tiny embedded systems and sensor network nodes. While event-driven programming can keep the mem- ory overhead down, it enforces a state machine programming style which makes many programs difcult to write, main- tain, and debug. We present a novel programming abstrac- tion called protothreads that makes it possible to write event- driven programs in a thread-like style, with a memory over- head of only two bytes per protothread. We show that pro- tothreads signicantly reduce the complexity of a number of widely used programs previously written with event-driven state machines. For the examined programs the majority of the state machines could be entirely removed. In the other cases the number of states and transitions was drastically de- creased. With protothreads the number of lines of code was reduced by one third. The execution time overhead of pro- tothreads is on the order of a few processor cycles.
Conference Paper
The event-driven programming style is pervasive as an effi- cient method for interacting with the environment. Unfortu- nately, the event-driven style severely complicates program maintenance and understanding, as it requires each logical flow of control to be fragmented across multiple independent callbacks. We propose tasks as a new programming model for or- ganizing event-driven programs. Tasks are a variant of co- operative multi-threading and allow each logical control flow to be modularized in the traditional manner, including us- age of standard control mechanisms like procedures and ex- ceptions. At the same time, by using method annotations, task-based programs can be automatically and modularly translated into efficient event-based code, using a form of continuation passing style (CPS) translation. A linkable scheduler architecture permits tasks to be used in many dif- ferent contexts. We have instantiated our model as a backward-compatible extension to Java, called TaskJava. We illustrate the benefits of our language through a formalization in an extension to Featherweight Java, and through a case study based on an open-source web server.
Conference Paper
Abstract Cooperative task management can provide program ar - chitects with ease of reasoning about concurrency is - sues This property is often espoused by those who recommend "event - driven" programming over "multi - threaded" programming Those terms conflate several issues In this paper, we clarify the issues, and show how one can get the best of both worlds: reason more simply about concurrency in the way "event - driven" advocates recommend, while preserving the readability and main - tainability of code associated with "multithreaded" pro - gramming We identify the source of confusion about the two pro - gramming styles as a conflation of two concepts: task management and stack management Those two con - cerns define a two - axis space in which "multithreaded" and "event - driven" programming are diagonally oppo - site; there is a third "sweet spot" in the space that com - bines the advantages of both programming styles We point out pitfalls in both alternative forms of stack man - agement, manual and automatic , and we supply tech - niques that mitigate the danger in the automatic case Finally, we exhibit adaptors that enable automatic stack management code and manual stack management code to interoperate in the same code base
Conference Paper
The current approach to developing rich, interactive web applications relies on asynchronous RPCs (Remote Pro- cedure Calls) to fetch new data to be displayed by the client. We argue that for the majority of web appli- cations, this RPC-based model is not the correct ab- straction: it forces programmers to use an awkward continuation-passing style of programming and to ex- pend too much effort manually transferring data. We propose a new programming model, MapJAX, to rem- edy these problems. MapJAX provides the abstraction of data structures shared between the browser and the server, based on the familiar primitives of objects, locks, and threads. MapJAX also provides additional features (parallel for loops and prefetching) that help develop- ers minimize response times in their applications. Map- JAX thus allows developers to focus on what they do best-writing compelling applications-rather than worry- ing about systems issues of data transfer and callback management. We describe the design and implementation of the MapJAX framework and show its use in three prototyp- ical web applications: a mapping application, an email client, and a search-autocomplete application. We evalu- ate the performance of these applications under realistic Internet latency and bandwidth constraints and find that the unoptimized MapJAX versions perform comparably to the standard AJAX versions, while MapJAX perfor- mance optimizations can dramatically improve perfor- mance, by close to a factor of 2 relative to non-MapJAX code in some cases.
Conference Paper
Tame is a new event-based system for managing con- currency in network applications. Code written with Tame abstractions does not suffer from the "stack- ripping" problem associated with other event libraries. Like threaded code, tamed code uses standard control flow, automatically-managed local variables, and modular inter- faces between callers and callees. Tame's implementation consists of C++ libraries and a source-to-source translator; no platform-specific support or compiler modifications are required, and Tame induces little runtime overhead. Expe- rience with Tame in real-world systems, including a pop- ular commercial Web site, suggests it is easy to adopt and deploy.