Content uploaded by Gabriel Kerneis
Author content
All content in this area was uploaded by Gabriel Kerneis on Nov 26, 2021
Content may be subject to copyright.
HAL Id: tel-00751444
https://tel.archives-ouvertes.fr/tel-00751444
Submitted on 13 Nov 2012
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Continuation-Passing C: Program Transformations for
Compiling Concurrency in an Imperative Language
Gabriel Kerneis
To cite this version:
Gabriel Kerneis. Continuation-Passing C: Program Transformations for Compiling Concurrency in an
Imperative Language. Programming Languages [cs.PL]. Université Paris-Diderot - Paris VII, 2012.
English. �tel-00751444�
Université Paris Diderot – Paris 7
École Doctorale Sciences Mathématiques de Paris Centre
Continuation-Passing C :
Transformations de programmes
pour
compiler la concurrence
dans
un langage impératif
Gabriel Kerneis
èse de doctorat
Spécialité informatique
Soutenue le vendredi 9 novembre 2012
devant le jury composé de :
M. Juliusz Chroboczek Directeur
M. Olivier Danvy Rapporteur
M. omas Ehrhard Président
M. Jean Goubault-Larrecq
M. Alan Mycroft Rapporteur
Université Paris Diderot – Paris 7
Continuation-Passing C:
Program Transformations
for
Compiling Concurrency
in
an Imperative Language
Gabriel Kerneis
Abstract
Most computer programs are concurrent programs, which need to perform several tasks at
the same time. Among the many dierent techniques to implement concurrent programs,
the most common are threads and events. Event-driven programs are more lightweight
and oen faster than their threaded counterparts, but also more dicult to understand and
error-prone. Additionally, event-driven programming alone is oen not powerful enough; it
is then necessary to write hybrid code, that uses both preemptively-scheduled threads and
cooperatively-scheduled event handlers, which is even more dicult.
is dissertation shows that concurrent programs written in a unied, threaded style can
be translated automatically into ecient, equivalent event-driven programs through a series
of proven source-to-source transformations.
Our rst contribution is a complete implementation of Continuation-Passing C (CPC), an
extension of the C programming language for writing concurrent systems. e CPC program-
mer manipulates very lightweight threads, choosing whether they should be cooperatively or
preemptively scheduled at any given point. e CPC program is then processed by the CPC
translator, which produces highly ecient sequentialized event-loop code, and uses native
threads to execute the preemptive parts. is approach retains the best of both worlds: the
relative convenience of programming with threads, and the low memory usage of event-loop
code.
We then prove the correctness of the transformations performed by the CPC translator.
We demonstrate in particular that lambda liing is correct for functions called in tail position
in an imperative call-by-value language without extruded variables. We also show that CPS
conversion is correct for a subset of C programs, and that every CPC program can be translated
into such a CPS-convertible form.
Finally, we validate the design and implementation of CPC by exhibiting our Hekate
BitTorrent seeder, and showing through a number of benchmarks that CPC is as fast as
the fastest thread librairies available to us. We also justify the choice of lambda liing by
implementing eCPC, a variant of CPC using environments, and comparing its performance
to CPC.
3
Résumé
Concurrence
La plupart des programmes informatiques sont des programmes concurrents, qui doivent
eectuer plusieurs tâches simultanément. Par exemple, un serveur réseau doit répondre à de
multiples clients en même temps ; un jeu vidéo doit gérer les entrées du clavier et les clics de
souris tout en simulant l’univers du jeu et en achant les éléments de décor; et un jeu en
réseau doit eectuer toutes ces tâches en même temps.
Il existe de nombreuses techniques diérentes pour implémenter des programmes concur-
rents. Une abstraction couramment utilisée est le concept de thread (ou processus léger) :
chaque thread encapsule un calcul précis pour l’eectuer isolément. Ainsi, dans un pro-
gramme à threads, les tâches concurrentes sont exécutées par autant de threads indépendants,
qui communiquent par l’intermédiaire d’une mémoire partagée. L’état de chaque thread est
stocké dans une structure de pile, qui n’est en revanche pas partagée.
Une alternative aux threads est la programmation en style à événements. Un programme à
événements interagit avec son environnement en réagissant à un ensemble de stimuli, appelés
événements; par exemple,dans un jeu vidéo,les touches frappées au clavier ou les clics de souris.
À tout moment est associé à chaque événement un morceau de code appelé le gestionnaire de
cet événement; un ordonnanceur global, appelé la boucle à événements, attend répétitivement
qu’un événement se produise et invoque le gestionnaire correspondant. Un calcul donné
n’est pas nécessairement encapsulé dans un unique gestionnaire d’événements : exécuter une
tâche complexe, qui requerrait par exemple à la fois des entrées venant du clavier et de la
souris, exige de coordonner plusieurs gestionnaires en échangeant les événements appropriés.
Contrairement aux threads, les gestionnaires d’événements ne disposent pas d’une pile
propre; les programmes à événements sont ainsi plus légers, et souvent plus rapides, que leurs
homologues à base de threads. Néanmoins, parce qu’elle nécessite de diviser le ot de contrôle
en de multiples petits gestionnaires d’événements, la programmation à événements est dicile
et sujette à erreur. De plus, elle n’est souvent pas susante en tant que telle, en particulier
pour accéder à des interfaces bloquantes ou pour exploiter des processeurs à cœurs multiples.
5
Résumé
Il est alors nécessaires d’écrire du code hybride, qui utilise à la fois des threads ordonnancés
préemptivement et des gestionnaires d’événements ordonnancés coopérativement, ce qui est
encore plus dicile.
Puisque la programmation par événements est plus dicile mais plus ecace que la pro-
grammation avec des threads, il est naturel de vouloir l’automatiser au moins partiellement.
D’une part, de nombreuses architectures et techniques de transformations ad-hoc ont été
proposées pour mêler threads et événements, essentiellement pour des langages impératifs tels
que C, Java et Javascript. D’autre part, un certain nombre d’abstractions et de techniques, dé-
veloppées pour implémenter des langages fonctionnels, ont été étudiées en détail et appliquées
à la construction de programmes fonctionnels concurrents : par exemple les monades, le style
par passage à la continuation (CPS, pour continuation-passing style) et la conversion CPS, ou
encore la programmation fonctionnelle réactive ont été utilisés dans des langages comme
Haskell, OCaml et Concurrent ML. Cette thèse cherche à réconcilier ces deux courants de
recherche, en adaptant des techniques de transformation classiques de la programmation
fonctionnelle an de construire un traducteur de threads en événements, correct et ecace,
pour un langage impératif.
Dans le chapitre 1, nous détaillons comment écrire des programmes concurrents à l’aide
de threads, d’événements et de continuations. Nous présentons notamment plusieurs styles
de programmes à événements, ainsi que les transformations que nous utiliserons par la suite :
la conversion en style par passage à la continuation (ou conversion CPS) et le lambda liing.
Continuation-Passing C
Continuation-Passing C (CPC) est une extension du langage C pour l’écriture de systèmes
concurrents. Le programmeur CPC manipule des threads très légers et peut choisir s’ils
doivent être ordonnancés préemptivement ou coopérativement en tout point du programme.
Le programme CPC est ensuite traité par le compilateur CPC, qui produit un code à événement
séquentialisé très ecace et utilise les threads natifs pour exécuter les parties préemptives.
Cette approche ore le meilleur des deux mondes : le confort relatif de la programmation à
base de threads et la faible occupation mémoire du code à événements.
Dans le chapitre 2, nous commençons par donner un aperçu du langage CPC à travers
l’exemple de Hekate, un serveur réseau qui est le programme CPC le plus étendu que nous
ayons réalisé. Nous montrons en particulier l’utilité d’un concept de thread unié, distinct
des threads natifs proposés par le système d’exploitation : la possibilité d’alterner sans eort
entre ordonnancement coopératif et préemptif ore les avantages du code à événements
sans la complexité de gérer manuellement des pools de threads pour eectuer les opérations
bloquantes. Nous donnons une description détaillée du langage CPC. Le cœur du langage
est aussi simple que possible, les structures de données plus complexes et les mécanismes de
synchronisation étant construit sur la base d’une demi-douzaine de primitives élémentaires.
Nous concluons le chapitre en donnant quelques exemples de telles structures de données,
incluses dans la bibliothèques standard CPC.
6
Technique de compilation prouvée
Le compilateur CPC traduit des programmes CPC en des programmes C équivalents grâce à
une série de transformations source-source. Chacune de ces étapes emploie des techniques
couramment utilisées pour la compilation de langages fonctionnels : conversion CPS, lambda
liing, environnements et traduction de sauts en appels terminaux. Néanmoins, nous les
utilisons dans le contexte du C, un langage notoirement hostile à la formalisation. Le simple
fait qu’il s’agisse d’un langage impératif rend la moitié de ces techniques indénies dans le
cas général. De plus, parce que le C permet d’obtenir l’adresse de variables allouées sur la
pile, un soin tout particulier doit être accordé pour garantir la correction d’une traduction
des threads, qui ont leur propre pile, vers les événements, qui n’en ont pas.
Dans le chapitre 3, nous présentons les transformations eectuées par le compilateur
CPC. Nous justions en particulier pourquoi l’étape d’encapsulation (boxing) est nécessaire
pour garantir la correction de la traduction quand l’adresse de variables de pile est capturée.
Puisque le lambda liing et la conversion CPS ne sont pas corrects en général dans un
langage impératif, nous fournissons des preuves de correction pour ces étapes (Chapitres 4
et 5). Dans le chapitre 4, nous prouvons la correction du lambda liing pour les fonctions
appelées en position terminale dans un langage impératif en appel par valeur, en l’absence de
variables extrudées. Dans le chapitre 5, nous prouvons que la conversion CPS est correcte dans
un langage impératif sans variables extrudées, statiques ou globales, bien qu’elle implique
l’évaluation anticipée (early evaluation) de certains paramètres de fonctions.
Implémentation et évaluation
Une partie essentielle de notre travail est l’ implémentation du compilateurCPC. Disposer d’un
compilateur fonctionnel est extrêmement utile pour expérimenter, développer des intuitions,
vérier certaines suppositions et eectuer des tests de performance.
Il est tentant, lorsque l’on travaille sur un langage de bas niveau tel que le C, de se concentrer
sur l’optimisation d’un petit nombre de détails d’implémentation dans l’espoir d’améliorer
les performances. Néanmoins, il est également une maxime bien connue des programmeurs
qui met en garde contre ce penchant
1
tant il est vrai qu’optimiser sans avoir au préalable
mesuré mène fréquemment à un code plus obscur sans pour autant apporter le moindre gain
de performances. Notre implémentation du compilateur CPC a été guidée par la conviction
que le lambda liing et la conversion CPS sont assez ecaces pour ne pas nécessiter de trop
nombreuses micro-optimisations. Notre code reste simple et aussi proche des transformations
théoriques que possible, mais parvient malgré tout à des performances comparables aux
bibliothèques de threads les plus ecaces que nous connaissions.
Dans le chapitre 6, nous comparons l’ecacité de programmes CPC à celle d’autres
implémentations de la concurrence, et montrons que CPC est aussi rapide que les plus rapides
d’entre elles tout en permettant orant un gain en occupation mémoire d’au moins un ordre
1«Premature optimisation is the root of all evil. »[Knu74]
7
Résumé
de grandeur. Nous eectuons plusieurs séries de mesures. Tout d’abord, nous mesurons indi-
viduellement l’ecacité des primitives de concurrence. Nous comparons ensuite le débit de
serveurs web, un exemple de concurrence typique, pour évaluer l’impact des transformations
eectuées par CPC sur les performances d’un programme complet. Enn, nous mesurons les
performances de Hekate, à la fois sur de matériel embarqué aux ressources limitées et sur un
ordinateur courant équipé de plusieurs cœurs soumis à une charge réaliste.
Quoique l’ecacité soit un paramètre essentiel, il est également important d’évaluer
l’utilisabilité du langage. Notre but étant de développer un langage de programmation agréable
à partir de techniques d’implémentation ecaces et prouvées, nous avons cherché à obtenir
un retour de la part d’utilisateurs pour évaluer l’expressivité et le confort apportés par CPC
pour l’écriture de programmes concurrents. Grâce au support complet du langage C par le
compilateur CPC, il est possible d’écrire de larges programmes utilisant des bibliothèques
C existantes. L’écriture d’Hekate, avec deux étudiants de master qui n’avaient jamais utilisé
CPC auparavant, a été l’occasion de découvrir des idiomes de programmation associés à la
légèreté et au déterminisme des threads CPC. Nous utilisons Hekate tout au long de cette thèse
pour fournir des exemples de code CPC, mais aussi comme d’une référence pour évaluer
l’impact des transformations eectuées par le compilateur CPC sur la taille et la structure du
code généré.
Comprendre le code à événements
L’étude de la transformation automatique des threads en événements est une occasion de
comprendre plus en détail la structure et le fonctionnement des programmes à événements.
Comment certains programmeurs parviennent à écrire des programmes si grands et complexes
sans devenir fous restera vraisemblablement un mystère pour toujours, mais cette thèse tente
d’éclairer un peu la question.
Le code généré par CPC dière de la plupart des programmes à événements écrits à la main
sur deux points : il contient plus de gestionnaires d’événements, et l’utilisation de lambda
liing implique la copie des variables locales d’un gestionnaire au suivant au lieu d’être
allouées une fois pour toutes sur le tas. Il est alors intéressant de modier le compilateur CPC
pour générer du code qui se rapproche plus du code écrit par un être humain et d’identier si
les modications nécessaires correspondent à des transformations de programme connues.
Cela permet également de comparer ces diérents styles et donne une idée des opérations
que le programmeur eectue dans sa tête quand il écrit du code à événements.
Le chapitre 7 est le résultat d’une collaboration avec Matthieu Boutier
[Bou11].
Nous y
étudions comment les divers styles à événements présentés dans le chapitre 1 peuvent être
générés à partir d’une description commune en style à threads, à l’aide des transformations
classiques que sont la défonctionalisation et les environnements. Nous implémentons en parti-
culier l’une de ces variantes, sous la forme d’eCPC, qui utilise des environnements au lieu du
lambda liing pour stocker les variables locales. Nous mesurons les performances de CPC et
eCPC et montrons que le lambda liing est plus ecace mais moins facile à déboguer que les
environnements dans la plupart des cas.
8
Contributions
Nous montrons dans cette thèse que des programmes concurrents écrits dans un style à
threads sont traduisibles automatiquement, par une suite de transformations source-source
prouvées, en programmes à événements équivalents et ecaces.
Nos contributions principales sont :
•une implémentation complète du langage CPC (Chapitre 2) ;
•
une technique de compilation basée sur des transformations de programmes prouvées
(Chapitre 3), en particulier :
–
une preuve de correction du lambda liing pour les fonctions appelées en position
terminale, dans un langage impératif en appel par valeur sans variables extrudées
(Chapitre 4),
–
une preuve de correction de la conversion CPS pour les programmes en forme
CPS-convertible dans un langage impératif sans variables extrudées, statiques et
globales (Chapitre 5);
•
des résultats expérimentaux évaluant l’utilisabilité et l’ecacité de CPC, notamment :
–Hekate, un serveur réseau BitTorrent écrit en CPC (Chapitre 2),
–
des mesures expérimentales montrant que CPC est aussi rapide que les biblio-
thèques de threads les plus rapides dont nous disposons, tout en permettant la
création d’au moins un ordre de grandeur de threads supplémentaires (Cha-
pitre 6);
•
une implémentation alternative, eCPC, utilisant des environnements au lieu du lambda
liing et permettant d’évaluer le gain apporté par ce dernier (Chapitre 7).
9
Remerciements
You are in a maze of twisty little passages, all alike.
Colossal Cave Adventure
Will Crowther
People are capable of learning like rats in mazes.
But the process is slow and primitive. We can learn
more, and more quickly, by taking conscious con-
trol of the learning process.
Mindstorms
Seymour Papert
Je voudrais remercier tous ceux qui m’ont aidé à sortir du labyrinthe sans me transformer
en rat.
Pour sa conance tout au long du chemin, son soutien sans faille, ses conseils précieux,
ses anecdotes innombrables sur les sujets les plus variés et son amitié, merci à Juliusz. Pour
l’incroyable minutie de ses corrections jusque dans les moindres détails et son rapport qui
résume si bien ma thèse, et pour avoir compris CPC en moins de temps qu’il ne m’en a
fallu pour l’expliquer autour d’une tasse de café, merci à Alan Mycro. Pour ses explications
patientes, tant sur les subtilités des continuations que sur la technique de rédaction d’un
article scientique, et pour ses suggestions sur ce manuscrit, merci à Olivier Danvy. Pour
son écoute et ses eorts de tous les instants pour faire de PPS un laboratoire où il fait bon
travailler, et pour assumer la lourde tâche de présider mon jury, merci à omas Ehrhard.
Merci enn à Jean Goubault-Larrecq de témoigner son intérêt pour CPC en acceptant de
siéger pour la deuxième fois en deux mois à son sujet. Votre présence à tous m’honore.
Merci à ceux qui ont contribué à cette thèse, parfois sans le savoir, untel par un conseil,
un autre par une idée. Parmi ces pollinisateurs, je retiens notamment : Vincent Padovani,
pour la mise en page du lemme 5.4.2 et ses longues explorations du ticket entailment ; Frédéric
Boussinot, pour son invitation à Sophia-Antipolis; Allan McInnes pour avoir cité mon article
11
Remerciements
sur Lambda the Ultimate
2
et Tom Du pour avoir été le premier à le commenter; Andy
Key, pour ses passionnantes explications sur Weave; Marco Trudel, pour sa bibliographie sur
l’élimination des gotos; Michael Scott pour son combat acharné contre les threads coopératifs;
Boris Yakobowski pour ses rapports de bugs; et tous les collègues de PPS qui ont supporté
mes divagations à un moment ou un autre durant ces quatre années.
Merci aux professeurs qui m’ont formé et guidé jusqu’aux portes du doctorat. Pour
l’informatique, je retiens Bruno Petazzoni, Olivier Hudry, Irène Charon, Samuel Tardieu,
Alexis Polti et Paul Gastin. Et s’il ne fallait en ajouter qu’une seule, pour les langues bien
vivantes, ce serait évidemment Marie-Christine Mopinot — elle appréciera à sa juste valeur,
j’en suis sûr, l’eort que constitua la rédaction de ce manuscrit.
Un paragraphe à lui tout seul, il le mérite vraiment, pour Michel Parigot. Il nous rappelle
par son action exemplaire qu’on ne se bat pas dans l’espoir du succès ! Non ! non, c’est bien
plus beau lorsque c’est inutile. Plus qu’utile au contraire, essentiels, irremplaçables combats de
celui qui, ne sachant pas que c’était impossible, l’a fait.
Une thèse serait bien insipide sans co-thésards. Les miens furent du meilleur cru : Mehdi
pour son aection indéfectible, Matthieu pour assurer la relève sans l’ombre d’une hésitation,
Stéphane « le coursier diligent » pour son soutien logistique, ibaut pour la guerre des
grenouilles, Antoine qui a rédigé si vite, Jonas qui nira bientôt, Mathias et Marie-Aude qui
ont ni trop tôt, Grégoire pour son soutien dès les premiers instants, Christine car que ferions-
nous sans elle?, Fabien pourles coqs, Florian, Julien et Claire même s’ils sont du LIAFA, Pierre,
Flavien, Guillaume, Shahin et Jakub. Sans oublier Pejman, l’auteur incontestable du meilleur
programme CPC au monde (Hekate), et de l’un des pires programmes Java (BabelDraw). Un
dernier merci pour nir cette longue liste (et j’en oublie forcément) à Odile : si les thésards
sont les forces vives du laboratoire, elle en est incontestablement un pilier, indispensable.
Des amis, je ne dirai rien, que les noms. Ils savent chacun à quel point ils comptent pour
moi, et combien j’ai pu compter sur eux : cela est de toute évidence et de toute éternité. Merci
donc à Chloé, Fabien, Sophie, Julien, Christine et Joris. Merci aussi à Grégoire, Julie, Raphaëlle,
Antoine et Gambetta.
Dans le guignon toujours présente, je pense enn à ma famille : à mes parents, à la force de
leur conance, à Nicole et Henriette à qui je dédie la soutenance, à Pierre-Emmanuel et Gildas,
et à Georges; à ma belle famille, aussi, d’un bout à l’autre des Boulay-Claverie-Rakotonirina;
et aux autres plus lointains, mais dèles, oncle, tante et cousins. Merci à Ermione et Fidji.
Merci à Camille, sans qui je n’aurais pas ni. Merci d’avoir supporté mes retards (ce soir
encore!), mes absences et mes longs week-ends. Merci d’avoir dit « oui ».
Cambridge, le 25 octobre 2012, à 20 heures.
2http://lambda-the- ultimate.org/node/4157.
12
Contents
Abstract 3
Résumé 5
Remerciements 11
Contents 13
Introduction 15
e wild land of concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
e power of continuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
e hazards of imperative languages . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1Background 29
1.1 readed and event-driven styles . . . . . . . . . . . . . . . . . . . . . . . . 29
1.2 Control ow and data ow in event-driven code . . . . . . . . . . . . . . . 36
1.3 From threads to events through continuations . . . . . . . . . . . . . . . . . 43
2Programming with CPC 49
2.1 An introduction to CPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2 e CPC language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3 e CPC standard library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3e CPC compilation technique 67
3.1 Translation passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 CPS conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Boxing........................................ 74
3.4 Splitting....................................... 75
13
Contents
3.5 Lambda liing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.6 Optimisation passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4Lambda liing in an imperative language 87
4.1 Denitions ..................................... 88
4.2 Optimised reduction rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3 Equivalence of optimised and naive reduction rules . . . . . . . . . . . . . 95
4.4 Correctness of lambda liing . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5CPS conversion in an imperative language 127
5.1 CPS-convertible form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Early evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3 CPS conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.4 Proof of correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6Experimental results 139
6.1 CPC threads and primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Web-server comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3 Hekate........................................ 147
7Alternative event-driven styles 149
is chapter is joint work with Matthieu Boutier [Bou11].
7.1 Generating alternative event-driven styles . . . . . . . . . . . . . . . . . . . 149
7.2 eCPC ........................................ 156
Conclusions 163
Full contents 169
List of Figures 173
List of Tables 176
List of Denitions 177
List of Lemmas 178
List of eorems 179
Index 181
Bibliography 185
14
Introduction
“And what is the use of a book,” thought Alice
“without pictures or conversation?”
Alice’s Adventures in Wonderland
Lewis Carroll
We show in this dissertation that imperative concurrent programs written in threaded
style can be translated automatically into ecient, equivalent event-driven programs through
a series of proven source-to-source transformations.
In this introduction, we rst present the notion of concurrency, and two of its implemen-
tations: threads and events. We also introduce Continuation-Passing C (CPC), an extension
of the C language for writing concurrent systems, which provides very lightweight threads
and compiles them into event-driven code. We then explain how to use another abstraction,
continuations, and the associated transformation, conversion into continuation-passing style,
to implement such a translation from threads into events. Finally, we show why it is dicult
to perform this translation in an imperative language, and give a brief overview of how the
CPC translator manages to keep it correct and ecient nonetheless. We conclude with an
outline our main contributions.
e wild land of concurrency
Task management
Writing large programs oen involves dividing them in a number of conceptually distinct
is section owes much
to the enlightening article
Cooperative Task
Management Without
Manual Stack
Management [Ady+02].
tasks: each task encapsulates control ow and oen a local state, and all tasks access some
shared state and communication channels to coordinate with other tasks. In a sequential
program, there is a single task or, more generally, tasks are fully ordered and each of them
waits for the previous one to complete before executing. Sequential task management makes
it easy to coordinate tasks: there is no risk of conicting accesses to the shared state, and
15
Introduction
the programmer can rely on the fact that no two tasks are ever executing simultaneously to
reason about the behaviour of the program. However, sequential task management is oen
too limited.
Most computer programs are concurrent: they perform several tasks at the same time.
ey might repeat a single task a large number of times. For instance, a web server needs to
serve hundreds of clients simultaneously. Or they might use a lot of dierent tasks, cooperating
to build a larger system. For example, a video game needs to handle the graphics and the
simulation of the game while reacting to keystrokes and mouse moves from the user. Some
complex programs might even need to do both. For instance, a network game server needs to
simulate a virtual word while sending and receiving updates from hundreds of players over
the Internet.
In a concurrent program, the order in which tasks are executed is determined by a sched-
uler. One distinguishes two kinds of scheduling: preemptive task management and cooperative
task management.
Preemptive task management
Preemptively scheduled tasks can be suspended and re-
sumed by the scheduler at any time. ey can be interleaved on a single processor or executed
simultaneously on several processors or processor cores. is makes them well suited for
highly-parallel workloads and high-performance programs.
One must be very careful when accessing shared state with preemptive tasks: since
any other task might be modifying the same piece of data at the same time, concurrent
accesses to shared resources need to be protected with synchronisation primitives such as
locks, semaphores, or monitors. Should these primitives be forgotten, or misused, race condi-
tions arise: conicts when accessing shared data that are oen hard to debug because they
might depend on a specic, non-deterministic scheduling that is dicult to reproduce. Pre-
emptive scheduling makes programs harder to reason about, because global guarantees about
shared state must be manually enforced by the programmer with appropriate locking and
synchronisation.
Cooperative task management
A cooperatively scheduled task only yields to another one
at some explicit points, called cooperation points. Between two cooperation points, the
programmer enjoys the ease of reasoning associated with sequential task management: a
single task accessing shared state at a given time, with no need to add locks or care about
race conditions. It is then only necessary to ensure that invariants about shared state hold
when cooperating, rather than in any possible interleaving as is the case with preemptive
scheduling. Cooperative schedulers can also be deterministic schedulers, providing guarantees
on the order of execution which further helps in synchronising tasks, reproducing bugs and
controlling fairness between tasks.
However, because of the exclusive nature of cooperative tasks, they cannot use the power
of multiple processors or processor cores. Moreover, since tasks cannot be preempted by
the scheduler, a single task performing a long computation or stuck in a blocking operation
would prevent every other tasks from executing: it is impossible to ensure fairness if a task
16
e wild land of concurrency
does not cooperate. is means that the programmer must make his program yield regularly.
Even in network programs, where each I/O is an opportunity to cooperate, this requirement
of cooperation sometimes happens to be too limiting.
Hybrid programming
ere are at least three cases where cooperative task management
is not enough. When a program needs to perform blocking system calls, use libraries with
blocking interfaces, or perform long-running computations, preemptive scheduling cannot
be avoided. Many cooperative programs are therefore in fact hybrid programs: they use
cooperative scheduling most of time, and fall-back to using preemptive tasks for potentially
blocking sections of code. is combines the advantages of both approaches, but is oen
tedious because concurrent systems rarely provide a straightforward way to switch from
preemptive to cooperative task management: the programmer usually has to cra his own
solution using two dierent frameworks, one for each kind of scheduling.
Stack management
As we have seen, concurrent programs can oen be divided in distinct tasks encapsulating a
control ow and a local state. However, these conceptual tasks do not always map directly to
the abstractions provided to the programmer by concurrency frameworks. In some models,
each task has its own call stack, to store its control ow and local variables; we then speak
of automatic stack management. In other models the call stack is shared among all tasks; we
speak of manual stack management. In the former case, stacks are independent and they
are automatically saved and restored upon context switches by the compiler, the library or
the operating system. In the latter, the programmer is responsible for multiplexing several
conceptual tasks on a single stack, handling the control ow and the local state of each task
by himself. Automatic stack management is of course easier for the programmer, but going
manual sometimes cannot be avoided, for eciency reasons or because the target system does
not provide concurrency abstractions with automatic stack management.
reads and event-driven programming are the two most common examples of automatic
and manual stack management, respectively.
reads
reads are a widely used abstraction for concurrency. Each thread executes a
function, with its own call stack, and all threads share heap-allocated memory. anks to
automatic stack management, threads are convenient for the programmer: each concurrent
task that needs to be implemented is written as a distinct function, and executed in a separate
thread. Interaction between threads uses the shared heap, and synchronisation primitives
such as locks and condition variables.
reads are provided either by the operating system, by a user-space library, or directly
by the programming language. Generally, threads provided by the operating system are
preemptive, and user-space threads cooperative. In both cases, the scheduler has very few
hints about the actual stack usage and behaviour of the program. erefore, it needs to reserve
a large, xed chunk of memory for the call stack upon creation of each thread, and saves and
17
Introduction
restores it fully at each context switch. is conservative approach potentially wastes a lot of
memory, for instance for idle threads with a shallow stack. Moreover, thread creation and
context switching are usually two to three orders of magnitude slower than a function call. In
implementations exhibiting these limitations, creating large numbers of short-lived threads is
not always possible, and in practice the programmer sometimes needs to multiplex several
concurrent tasks on a single thread.
However, threads are not necessarily slower than event-driven code, and careful imple-
mentations can yield large performance gains. One solution is to use user-space, coopera-
tive libraries which tend to be faster and more lightweight than native, preemptive threads
[Beh+03].
A user-space library does not incur the cost of going through supervisor mode,
allowing faster context switches. It can also reduce memory usage by using smaller, or even
dynamically-resized, call stacks [Gu+07].
Another approach is to make concurrency constructs part of the programming language,
using information from the compiler for optimisations. For example, the compiler can
perform static analysis to reduce the amount of memory used, saving only relevant pieces
of information on each context switch. is is most eective in the context of cooperative
threads because the compiler can also determine the point of cooperation in advance and
optimise atomic blocks of code accordingly. It is also possible to add heuristics to choose
between dierent implementations at compile time.
Events
Event-driven programs are built around an event loop which repeatedly gathers
external stimuli, or events, and invokes a small atomic function, an event handler, in reaction
to each of them. Task management is manual in event-driven style: there is no abstraction to
represent a concurrent task with its own control ow and local state, and no automatic transfer
of control ow from one handler to the next. If performing a task requires interacting with
several events, for instance in a server exchanging a number of network messages with a client,
the programmer is responsible for writing each event handler and registering them with the
event loop in the correct order as the execution of the task ows. Similarly, synchronisation is
achieved manually by registering handlers and ring the appropriate events. Large persistent
pieces of data are generally shared between event handlers, while short-lived values used
in a handful of handlers are kept local and copied from one handler to the next; again, this
replicates manually how local variables and heap-allocated data are used in threads.
Events allow concurrency to be implemented in languages that do not provide threads,
and cannot be avoided on systems exposing only an asynchronous, callback-based API;
even when threads are available, they provide a lightweight alternative, well-suited to highly-
concurrent and resource-constrained programs. But events are also an extreme example of the
programmer implementing scheduling and context switching entirely by hand, sequentializing
and optimising manually concurrency in his program, then relying on the compilerto compile
the resulting sequential code eciently. is is a very tedious task, and it yields programs
that are hard to debug because they lack a call stack: information about control ow and local
state must be extracted from custom, heap-allocated data structures. As we shall see, this is
also not the most ecient approach (Chapter 7).
18
e wild land of concurrency
Event-driven programs are inherently cooperative, since event handlers are guaranteed to
run atomically and control is passed around explicitly by the programmer. As explained above,
this enables deterministic scheduling and makes reasoning about a particular piece of code
easier. e price to pay for this simpler scheduling is that, just like cooperative threads, event-
driven programs do not benet from parallel architectures with multiple cores or processors,
and get frozen if a handler performs a blocking operation. In practice, most event-driven
programs are therefore hybrid programs, delegating blocking and long-running computations
to native threads, or distributing events across several event loops running in independent
threads. is hybrid style makes them even harder to debug.
CPC threads
Since event-driven programming is more dicult but more ecient than threaded program-
ming, it is natural to wantto at least partially automate it. On the one hand, many architectures
for mixing threads and events, and ad-hoc translation schemes have been proposed, mostly
for wide-spread imperative programming languages such as C, Java or Javascript. On the
other hand, a number of abstractions and techniques, developed to implement functional
languages, have been studied extensively and applied to build concurrent functional programs:
for instance monads, continuation-passing style (CPS) and CPS conversion, or functional
reactive programming, in languages such as Haskell, OCaml or Concurrent ML.
is dissertation seeks to bridge the gap between these two streams of research, adapting
well-known transformation techniques from functional programming to build an ecient and
correct translator from threads to events in an imperative language.
We propose Continuation-Passing C (CPC), an extension of the C language for concurrency
designed and implemented with Juliusz Chroboczek. e CPC language oers a unied
abstraction, called CPC threads, that are neither native threads nor user-space, library-based
threads. Most of the time, CPC threads are scheduled cooperatively but the programmer has
the ability to switch a thread between cooperative and preemptive mode at any time. To the
programmer, CPC threads look like extremely lightweight, user-space threads.
When a CPC thread is created, it is attached to the main CPC scheduler, which is cooper-
ative and deterministic. CPC provides a number of primitives that interact with the scheduler,
to yield to another thread, sleep, wait for I/O or synchronise on condition variables; the pro-
grammer then builds more cooperative functions on top of these primitives. ere is a special
primitive,
cpc_link
, that allows the programmer to detach the current CPC thread from
the cooperative scheduler, and execute it in a native, preemptive thread instead. Conversely,
cpc_link
also allows the user to attach a detached thread back to the cooperative scheduler.
ese unied threads provide the advantages of hybrid programming, without the hassle of
combining manually distinct concurrency models. is is not a silver bullet: the programmer
must still take care not to block in attached mode, and to use locks properly in detached mode.
However, CPC threads eliminate all the boilerplate needed to switch back and forth between
19
Introduction
both modes, making it straightforward to call blocking functions asynchronously, without
writing callbacks manually.
To compile CPC threads eciently, we translate CPC programs into event-driven C code,
which is then handed over to a regular C compiler. is approach retains the best of both
worlds: unied threads for the programmer, easier to reason about and with a simple way
to switch scheduling mode, and tiny event handlers at runtime with fast context switches
which allow tens of thousands of threads to be created, even on platforms with constrained
resources.
e power of continuations
e CPC translator translates CPC programs in threaded style into equivalent plain C programs
in event-driven style. It automates the manual work usually performed by the event-loop pro-
grammer: splitting long-running tasks into small atomic event handlers around cooperation
points, creating small data structures to pass local data from one handler to the next, and
linking handlers correctly to implement the control ow of the original program. e resulting
event handlers are scheduled cooperatively by an event loop, or executed by individual native
threads when the programmer detaches a CPC thread to preemptive mode.
In manually written event-driven code, the programmer rolls his own data structures to
register event handlers and save the pieces of local state that need to be passed to the next
handler. Since we want to perform an automatic source-to-source translation, we seek a
systematic way to generate event handlers from a threaded control ow. More precisely, we
need some data structure to capture the state of a thread, that is to say its current point of
execution and local variables, when it reaches a cooperation point.
Continuations are an abstraction that is widely used to represent the control ow, and in
particular to implement concurrency, in functional programming languages. ey can be
implemented in a number of ways, including as data structures that fulll our needs to capture
the state of CPC threads in an event-driven style. Because the C language does not oer
rst-class continuations that we could use directly, the CPC translator performs a conversion
into continuation-passing style, a transformation that introduces continuations in a program
written in direct-style.
Continuations and concurrency
Intuitively, the continuation of a fragment of code is an abstraction of the action to perform
aer its execution. For example, consider the following computation, where
add
represents
the arithmetic operator +:
Figure 1: Small
continuation
g ( ad d ( f (5 ) , x )) ;
e continuation of
f(5)
is
g(add(◻, x))
because the return value of
f
, represented by
◻
in the continuation, will be added to the value of the variable xand then passed to f.
20
e power of continuations
A continuation captures the context at some given point in the program: it implicitly
records the current instruction, “
◻
” in the previous example, and the local state, for instance
the local variable
x
to be added to the return value of
f
. Continuations are therefore perfectly
suited to implementing concurrency. If the call to
f
were a cooperation point in a threaded
program for instance, saving its continuation would be enough to resume execution aer a
context switch.
Continuations are most oen used in functional programming languages. Some of them,
like Scheme
[Abe+98]
or Scala
[Ode+04],
provide rst-class continuations with control oper-
ators, such as
call/cc
or
shift
and
reset
respectively, that allow a program to capture and
resume its own continuations. Cooperative threads and other concurrency constructs are
then built on top of these operators [HFW86; DH89; RMO09].
In functional languages that do not provide rst-class continuations, continuations are
encoded using other features such as rst-class functions or monads. ese constructs can
then be used to implement concurrency libraries: concurrency monads in Haskell
[Cla99],
or lightweight lwt threads in OCaml
[Vou08].
To some extent, these concurrent programs
based on continuations are similar to event-driven programs: the programmer writes many
small atomic functions, which make the continuations explicit, and composes them using
synchronisation functions provided by the library. However, abstractions provided by func-
tional languages make writing such programs usually less tedious than event-driven code:
anonymous lambda-abstractions alleviate the burden of naming every intermediary event
handler, and variables can be shared between inner functions and need not be passed explicitly
from one handler to the next. With some syntactic sugar to write monads concisely, this yields
pleasant and idiomatic code that is in fact more similar to threads than to hard-to-follow,
event-driven code written in an imperative language.
Implementing continuations
ere are several ways to implement continuations. One approach is to think of continuations
as functions, and implement them using closures. e continuation
g(add(◻, x))
from
Fig. 1 is similar to the function
λr.g(add(r, x))
which waits for the return value of
f
,
sums it with
x
and passes the result to
g
. Note that the value of
x
is captured in the closure:
local variables are preserved automatically through environments. As explained above, this
approach is mainly used in functional languages that do not provide rst-class continuations.
is technique does not work in the case of CPC because the C language does not feature
rst-class functions.
Continuations can also be implemented with stacks: capturing a continuation is done by
copying the call stack, and resuming it by discarding the current stack and using the saved one
instead. is approach is very similar to the way threads are saved and restored in concurrent
programming, and it has indeed been shown that threads can implement continuations
[KBD98].
It is obviously not usable in the case of CPC since our goal is to compile threads
into lightweight event handlers to minimise the amount of memory used per thread.
21
Introduction
Finally, continuations can be implemented as a stack of function calls to be performed.
Contrary to the native call stack, this stack does not represent the current state of the program,
but the rest of the computation as an implicit composition of functions. Consider again Fig. 1:
g ( ad d ( f (5 ) , x )) ;
e continuation of
f(5)
is
add(x) ⋅g
. It is a stack of two function calls: rst pass the result
of
f(5)
to
add(◻, x)
, then pass the result to
g(◻)
. is is the approach used in CPC. It
is similar to event-driven code: we store a list of callbacks to invoke later, along with the
values of useful variables, and the return value of each callback is passed to the next. It is
straightforward to implement in C, using function pointers for callbacks and structures to
store function parameters, and retains only the data relevant to resume the computation.
Conversion into Continuation-Passing Style
Since the C language does not oer rst-class continuations, the CPC translator needs to
transform CPC programs to introduce continuations. Conversion into Continuation-Passing
Style
[SW74; Plo75],
or CPS conversion for short, is a program transformation technique that
makes the ow of control of a program explicit and provides continuations for it.
CPS conversion consists in replacing every function
f
in a program with a function
f⋆
taking an extra argument, its continuation. Where
f
would return with value
v
,
f⋆
invokes its
continuation with the argument v. Remember the computation that we considered in Fig. 1:
g ( ad d ( f (5 ) , x )) ;
Aer CPS conversion, the function
f
becomes
f⋆
which receives its continuation as an
additional parameter, λr.g(add(r, x)).
Figure 2: Partial CPS
conversion
f⋆(5 , λr . g ( ad d ( r , x )) );
A CPS-converted function therefore never returns, but makes a call to its continuation. Since
all of these calls are in tail position, a converted program does not use the native call stack: the
information that would normally be in the call stack (the dynamic chain) is encoded within
the continuation.
In the context of concurrent programming, having a handle on the continuation makes it
easy to implement cooperative threads. Consider for instance the case where
f
has to yield
to some other thread before returning its value. One simply needs to store the continuation
λr.g(add(r, x))
, which has been received as a parameter, and invoke it later, aer having
run other threads, to resume the computations.
Figure 2 is an example of partial CPS conversion: only the function
f
is CPS-converted,
g
and
add
are le in direct (non-CPS) style. It is also possible to perform a full CPS conversion,
translating every function.
Figure 3: Full CPS
conversion
f⋆(5 , λr . add⋆( x , r , λs. g⋆(s , k ) ) );
Note that in the case of a full CPS conversion, the last called function,
g⋆
, expects a continua-
tion too: we need to introduce a variable
k
to represent the top-level continuation, the context
in which this fragment of code is executed.
22
e hazards of imperative languages
We can nally rewrite Fig. 3 without lambda-terms, using the more compact implementa-
tion of continuations that we introduced in the previous section.
f⋆(5 , add⋆(x) ⋅g⋆⋅k);
is last example is fairly close to the CPS conversion actually performed by the CPC translator.
e main dierence is that CPC performs a partial translation: because a call to a CPS-
converted function (or CPS call, see Section 2.2.1) is slower than a native call, we only translate
these functions, called CPS functions, that are annotated as cooperative by the programmer
with the
cps
keyword. Hence, in our example, the function
add
would probably be kept in
direct-style, while fand gwould be annotated with cps.
e hazards of imperative languages
CPS conversion, CPS-convertible form and splitting
e CPC translator is structured as a series of source-to-source passes, transforming CPC
programs into plain C event-driven code. Although it might be possible to directly dene
a CPS conversion for the whole of the C language, we found it too dicult in practice. In
particular, in the presence of loops and
goto
statements, the continuations are not as obvious
as in the example shown above. In Fig. 1, the control ow merely consists in nested function
calls, which makes it easy to express the continuation of
g
in terms of “functions to be called
later”. erefore, the CPC translator performs several preliminary passes to bring the code into
CPS-convertible form, a form suitable for CPS conversion, before the actual CPS-conversion
step.
CPS-convertible form is similar to the example shown previously: each call to a function
that must be CPS-converted is guaranteed to be followed by a tail call to another CPS-converted
function.
Figure 4: CPS calls in
CPS-convertible form
f () ; re t ur n g ( a1,...,an);
If the top-level continuation is
k
, the continuation of
f
is
g(a1,...,an)⋅k
and the CPS
conversion is straightforward.
To translate a CPC program into CPS-convertible form, the CPS translator replaces
the direct-style chunk of code following each CPS call by a call to a CPS function which
encapsulates this chunk. We call this pass splitting because it splits each original CPS function
into many, mutually recursive CPS functions in CPS-convertible form. ese functions are
similar to event handlers: they are small atomic chunks of code that end with a cooperative
action, the continuation (callback) of which is explicit.
ese functions encapsulating chunks of code, introduced during the splitting pass, are
inner functions, dened within the original, split CPS function. ey do not have their own
local variables, sharing them instead with the enclosing function: this yields free variables,
dened in the enclosing function but unboundin the inner ones. For example, in the following
code, the local variable i, dened in f, is a free variable in the inner functions f1and f2.
23
Introduction
Figure 5: Split CPS
function
cp s v o id f ( in t i ) {
cps int f1() {
i ++ ;
cpc_sleep(1);
f2();
}
cps int f2() {
if ( i > 1 0 ) r e tu r n ;
f1();
}
f1();
}
However, inner functions and free variables are not allowed in the C language; they only exist
as intermediary steps in the transformations performed by the CPC translator. Another pass is
needed aer splitting, to eliminate these free variables and retrieve plain C in CPS-convertible
form.
Lambda liing and environments
ere are two common solutions used in functional languages to eliminate free variables:
lambda liing and environments (or boxing). Lambda liing binds free variables in inner
functions by adding them as parameters of these functions; the body of functions is not
modied, except for adding free variables as parameters at every call of an inner function. As
a result, each inner function gets its own local copy of free variables, and passes it the next
function. For example, here is Fig. 5 aer lambda liing, with the lied variable
i
added as a
parameter to f1and f2.
cps int f1( in t i ) {
i ++ ;
cpc_sleep(1);
f2( i );
}
cps int f2( in t i ) {
if ( i > 1 0 ) r e tu r n ;
f1( i );
}
cp s v o id f ( in t i ) {
f1( i );
}
A copy of the variable iis made on every call to f1and f2.
On the other hand, with environments, free variables are boxed in a chunk of heap-allocated
memory and shared between inner functions; the body of the functions is modied so that
every access to a shared variable goes through the indirection of the environment. Consider
once again Fig. 5, using an environment econtaining a boxed version of i.
24
e hazards of imperative languages
typ e d ef s t ruc t e n v { int i ; } env ;
cps int f1( en v * e ) {
(e - > i ) + +;
cpc_sleep(1);
f2( e );
}
cps int f2( en v * e ) {
if ( e - > i > 1 0) { f re e e ; r e t ur n ; }
f1( e );
}
cp s v o id f ( in t i ) {
en v * e = m al l oc ( s i ze of ( e nv ) );
e - >i = i;
f1( e );
}
e environment
e
is allocated and initialised in
f
, freed before returning in
f2
, and every
access to iis replaced by e->i.
e CPC translator uses lambda liing. We have chosen this technique because a com-
pilation strategy based on environments would most certainly have resulted in a signicant
overhead. We mentioned earlier that boxing is commonly used to compile functional pro-
grams; but the overhead of allocating and freeing memory for environments, as well doing
indirect memory accesses, is reasonable in a language such as Scheme because most variables
are never mutated, and can therefore be kept unboxed. On the other hand in C, where mutated
variables are the rule and
const
variables the exception, almost every variable would have to
be boxed, hindering compiler optimisations by the use of heap-allocated variables instead
of local variables. We have conrmed this intuition indeed, with benchmarks showing that,
even with a careful implementation of boxing, lambda liing is faster than environments in
most cases (Chapter 7).
Duplicating mutable and extruded variables
Mutable variables
ere is a correctness issue when using lambda liing in an imperative
language with mutable variables. Because lambda liing copies variables, it can lead to
incorrect results if the original variable is used aer the copy has been modied. For example,
the following program cannot be lambda-lied correctly.
cp s vo i d f ( i nt rc ) {
cp s v o id set () { r c = 0; r et u r n ; }
cp s v oi d d on e () {
p ri n tf ( " rc ␣ = ␣ % d \n " , r c ) ;
return;
}
se t ( ); do n e () ; r e tu rn ;
}
is function sets the variable
rc
to
0
, regardless of its initial value, then prints it and returns.
e variable
rc
is free in the functions
set
and
done
. We lambda-li it, and rename it in each
function for more clarity.
25
Introduction
cp s v o id f ( in t r c1) {
se t ( r c1) ; d o ne ( r c 1); re t ur n ;
}
cp s v o id s et ( i nt rc 2) {
rc 2= 0;
return;
}
cp s v o id d on e ( i nt rc 3) {
p ri n tf ( " rc ␣ = ␣ % d \n " , r c3) ;
return;
}
e lambda-lied version modies the copied variable
r2
, but then copies again
rc1
into
r3
and prints this copy. erefore, it actually prints the initial value of
rc1
, which might be
dierent from 0.
In order to ensure the correctness of lambda liing, one needs either to replicate the
modications to every copy of the variable, or to enforce that a variable is never modied
aer it has been copied by lambda liing. e former idea is hardly usable in practice because
it requires to track every lied variable: this would probably involve indirections, and would
be at least as costly as using environments.
As we shall see in Chapter 4, we use the latter solution: even in the presence of mutated
variables, the CPC compilation technique ensures the correctness of lambda liing by en-
forcing the fact that unboxed lied variables are never modied aer having been copied by
lambda liing. As we explained before, keeping a straightforward lambda-liing pass without
boxing is essential for the eciency of CPC programs.
Extruded variables
e situation gets even worse in the presence of extruded variables,
variables whose address has been retained in a pointer through the “address of ” operator
&
.
ese variables can be modied from any function that has access to the pointer, and cannot
be copied since that would make their address stored in the pointer invalid.
is is an issue for lambda liing, which copies variables, but also for CPS conversion.
Consider the following example, where the function
f
can modify the extruded variable
x
via
the pointer p.
cp s i n t s et ( i nt * p) {
*p = 1;
ret u rn 2;
}
cp s i n t f ( ) {
in t x = 0 , r ;
r = s et ( & x ) ; r e tu r n a dd ( r , x );
}
e function
f
sets
x
to
1
, then adds it to
2
, returning
3
. If
k
is the current continuation, the
continuation of the call to
set(&x)
is
add(x) ⋅k
, which yields the following code aer CPS
conversion.
26
e hazards of imperative languages
cp s vo i d set⋆( int * p , c ont k ) {
*p = 1;
k(2);
}
cp s vo i d f⋆( c on t k ) {
in t x = 0 ;
set⋆(& x , add⋆(x) ⋅k) ;
}
e variable
x
, whose original value is
0
, is copied in the continuation
add(x)⋅k
when creating
it. e variable
x
is later modied by
set
, but the copy in the continuation is not updated
and the code returns 2(=0+2) instead of 3.
Encapsulating extruded variables in environments solves the problem: instead of variables,
a pointer to the environment is copied in the continuation. en, the function
set
modies the
boxed variable, and the function
add
accesses the updated version, through the environment.
Boxing also solves the problem of pointers to extruded variables becoming invalid when the
variable is copied: since variables are boxed only once, their addresses do not change and can
be used reliably by the programmer.
e magic of CPC
To preserve its correctness in the hostile context of the C language, surrounded by traps of
mutable and extruded variables, the CPC translator could cowardly use systematic boxing to
shield all local variables in environments. In practice, however, such a conservative approach
would not be acceptable: C programs rely heavily on mutating local stack variables, and
allocating all of them on the heap would add a signicant overhead. We follow a bolder, less
obvious approach.
e CPC translator keeps most variables unboxed, but uses lambda liing and CPS
conversion nonetheless. In fact, only extruded variables are boxed, which amounts in practice
to less than 5 % of lied variables in Hekate, our largest CPC program. Since copying variables
changes their address, which in turn breaks pointers to them, we cannotavoid boxing extruded
variables anyway. With this lowerbound on the amountof boxing, the CPC translator manages
to use lambda liing and CPS conversion with as little boxing as possible, performing correct
transformations with a limited overhead.
is result is made possible by the fact that the CPC translator does not operate on
arbitrary programs. We have shown that
in an imperative call-by-value language
without extruded, static, and global variables,
CPS conversion and lambda liing are correct
for programs in CPS-convertible form
obtained by splitting.
More precisely, we will show the following two results:
27
Introduction
•
Lambda liing is correct when lied functions are called in tail position (eorem 4.1.9).
Intuitively, when lied functions are called in tail position, they never return. Hence,
modifying copies of variables is not a problem since the original, out-of-date variables
are not reachable anymore. As it turns out, lied functions in CPC are the inner
functions introduced by the splitting pass, which are always called in tail position.
•
CPS conversion is correct for programs in CPS-convertible form (eorem 5.4.1). Intu-
itively, storing copies of variables in continuations is an issue when another function
later in the call chain modies the original variable. But only local variables are copied
into continuations, and a function cannot modify the local variables of other functions—
except for extruded variables, which are boxed to avoid this problem.
Contributions
e main contributions of this dissertation are:
•a complete implementation of the CPC language (Chapter 2);
•
a compilation scheme based on proven program transformations (Chapter 3), in partic-
ular:
–
a proof of correctness of lambda liing for functions called in tail position in an
imperative call-by-value language without extruded variables (Chapter 4),
–
a proof of correctness of the CPS conversion for programs in CPS-convertible
form in an imperative language without extruded, static and global variables
(Chapter 5);
•experimental results evaluating the usability and eciency of CPC, including:
–Hekate, a BitTorrent network server written with CPC (Chapter 2),
–
benchmarks showing that CPC is as fast as the fastest thread libraries available to
us while allowing an order of magnitude more threads (Chapter 6);
•
an alternative implementation, eCPC, using environments instead of lambda liing
in order to compare the overhead of indirect memory accesses and larger allocations,
versus repeated copies of local variables (Chapter 7).
28
Chapter 1
Background
We have seen in the introduction (page 15) that threads and events are two common techniques
to implement concurrent programs. We review them in more details in Section 1.1. We study
in particular how to implement a simple program in both styles, what it implies in terms of
code readability and memory footprint, and when each style might be more suitable.
As it turns out, events are actually a generic term to describe a wide range of manual
techniques for concurrency. In Section 1.2, we compare several event-driven styles from
real-world programs, and analyse how the programmer encodes the ow of control and the
data ow manually in each of them.
Because threads are more convenient to write concurrent programs but sometimes not
available or ecient enough, an idea to keep the best of both worlds is to translate threads
into events automatically. In Section 1.3, we review existing techniques to perform this
transformation, and previous work on bridging the gap between threads and events.
1.1 readed and event-driven styles
1.1.1 reads
An example of threaded style
Consider the following OCaml program that counts one sheep per second.
Figure 1.1: Sequential
count
le t r ec cou n t n ani m al =
p ri n t_ i nt n ; pr i n t_ e nd l in e an i ma l ;
sl e e p 1. ;
co u nt ( n +1 ) a n i ma l
(* s t art c o unt i n g s he e p *)
co u nt 1 " ␣ sh e ep "
e function
count
is an innite counter that displays the number of animals reached so far,
sleeps for one second,
1
then calls itself recursively to count the next animal. is is a purely
1Hence this program sleeps to count sheep, instead of counting sheep to fall asleep.
29
1. Background
sequential function which no longer works if we want to count several animals concurrently:
since it never returns, it is impossible to start another counter aer calling
count 0 " sheep"
.
To introduce concurrency in this program, one straightforward solution is to use threads.
In OCaml, we only need to prex the call to
count
with
Thread.create
. For instance, the
following code counts sh and sheep in two separate threads of execution, starting the second
thread half a second aer the rst one.
Figure 1.2: Threaded
count
(* s t art c o u nti n g f ish a nd sh e ep con c u r ren t l y *)
T hr ea d . c r ea t e c ou n t 1 " ␣ fi sh "
sl e e p 0. 5
T hr ea d . c re a te c ou n t 1 " ␣ s h ee p "
e output looks as follows:
1 fish
1 she e p
2 fish
2 she e p
...
where one line is issued every half second.
In threaded programs, each task is executed by a separate thread with its own control ow
and local variables, independent of other threads. A scheduler is responsible for executing
each thread in turn, saving the state of the current thread and switching to the next one at
points called context switches. In Fig. 1.2, the scheduler executes the thread counting sh, then
aer some time it saves the current point of execution and the value of the local variables
n
and
animal
, switches to the thread counting sheep, restores its local variables, and continues
its execution where it le on the previous context switch. Provided the scheduler performs
these context switches fairly and oen enough, the output looks like a witness of two tasks
executing simultaneously.
Implementing threads
reads in a given program share every resource—memory space, le descriptors, etc.—except
their call stack and the CPU registers. Each thread has its own call stack that contains the
activation records of the functions that it is currently executing; each activation record holds
the information related to a single function execution and they are stacked one above the
other as functions call each other
[Aho+88, p. 398].
e call stack captures most of the state
of a thread’s computation: the activation record of a function call stores, for instance, the local
variables and function parameters for this call. Some of the state of the computation is also
contained in the CPU registers. Most importantly, the stack pointer and program counter
capture respectively the current thread and its current instruction. ey need to be saved too
so that the thread can be resumed in the exact same state later on.
When the scheduler switches from one thread to another, it saves the registers and schedul-
ing information in a data structure called a process control block
2
(PCB)
[Dei90, p. 57];
the
2
e terminology is not uniform across operating systems. In Linux, PCB is called process descriptor
[BC00,
Chapter 3] and is stored in a structure called task_struct.
30
1.1. readed and event-driven styles
call stack need not be saved in most cases since it is already located in memory. e scheduler
then decides which thread to run next, grabs its PCB, and copies its content back into the
registers. Since the registers contain in particular a pointer to its call stack, it is then ready to
resume its execution until the next context switch.
reads are implemented and scheduled either by the operating system (OS), or by a
user-space library. For user-space threads in interpreted languages or languages using a virtual
machine, performing a context switch is sometimes as simple and ecient as swapping two
pointers. Native threads, scheduled by the OS, are usually more heavyweight because each
context switch is associated with a switch to supervisor mode.
When threads are not suitable
ere are a number of reasons why a programmer might not want, or be able, to use threads
to write a concurrent program: memory overhead, eciency, lack of language or platform
support.
e most common reason for avoiding threads is because one cannot aord the memory
overhead that they entail. Since each thread reserves space for its own stack, it uses a xed
amount of memory. Because of this historical implementation choice, threads are not the
ideal abstraction in the case of systems with many idle threads, which do not perform any
useful work but keep wasting memory. In embedded systems with limited resources, this
quickly becomes an unacceptable overhead as the number of threads increases. is is also an
issue for highly concurrent programs that wish to use many threads: for instance, a web server
that accepts ten thousands clients and uses three threads per client would need around 4 GB
of memory, for stack space alone, using the NPTL Posix threads library on Linux. Previous
authors have proposed implementations techniques to reduce this memory overhead, such as
InterLISP’s spaghetti stacks
[BW73],
GCC’s split stacks
[Tay11],
or linked-stack
[Beh+03]
and
shared-stack threads
[Gu+07].
However, in most real-world cases, programmers use events
instead of threads.
reads are also sometimes avoided for eciency reasons, because they might interact
badly with CPU caches. Because threads share a common memory space, two threads writing
in distinct but close variables will hit the same cache line, an issue known as false sharing:
there is no real conict between the writes performed by the two threads, but when the
variables happen to be in the same cache line, the CPU considers them as a single block. On
computers with multiples processors or processor cores, this generates cache coherency trac
between the processors, and this trac has a signicant impact on performance. To avoid false
sharing, the programmer might use processes instead of threads, sharing and synchronising
explicitly the necessary memory chunks; dthreads is an implementation of threads upon
processes that automates this idea
[LCB11].
Another alternative is to use a distributed event-
driven architecture such as AMPED
[PDZ99]
or SEDA
[WCB01]:
the program is split into
several processes, each of them executes an independent event loop, and communication and
synchronisation are performed by the means of events exchanged through queues between
the various processes.
31
1. Background
Beyond space and time eciency, there are other, more fundamental, reasons why a
programmer might not be able to use threads. Some programming languages do not provide
threads as a concurrency primitive. is is the case for instance of Javascript
[Ecm09]:
it would
be impossible to write the “counting sheep” example (Fig. 1.2) in Javascript because there is no
such thing as a
Thread.create
function for this language. (In fact, even the sequential exam-
ple shown in Fig. 1.1 could not be written because there is no
sleep
function in Javascript.) To
help ensure the reactivity of programs, the designers of the language have decided to exclude
synchronous functions, which might block, and to provide only asynchronous alternatives, to
be used in event-driven style. We shall see in the next section how to use such functions to
write an event-driven equivalent of Fig. 1.2.
Asynchronous interfaces can also be imposed by the underlying OS, for example as a way
to perform non-blocking Input/Output operations (I/O). Sometimes, the OS does not provide
threads at all, and the sole API for I/O is asynchronous. is is common in particular in
embedded systems, where the limited resources preclude the use of threads. For example, some
adapters for RAID hard drives developed by IBM around the year 2000 used an architecture
called Independent Packet Network (IPN). Each IPN node ran a small rmware, the IPN
kernel, whose system calls were all asynchronous, from allocating and freeing memory to
reading and writing network packets. is forced the device drivers developed above the
IPN kernel to be implemented in purely event-driven style
[Key10].
It also happens that
asynchronous I/O is an optional complement to threads, designed to use multiple processors
and improve performance; using them then yields a hybrid style mixing threads and events.
is is the case on Windows, where I/O Completion Ports (IOCP
[RN11, Chapter 10]
) are
the recommended way to perform ecient I/O, in combination with threads.
1.1.2 Events
A common alternative to threads for writing concurrent programs is the use of events. In
threaded style, each task is contained in a single execution unit, a thread, that is suspended
and resumed by a scheduler. In event-driven programming, each task is split in several small
functions, called event handlers or callbacks, that are scheduled by an event loop. e execution
of these event handlers is triggered by certain events, like the expiration of a timeout, the
availability ofdata for I/O, or a client connecting to a network server. e event loop repeatedly
collects new events, compares them to a set of registered event listeners and dispatches them
to the relevant event handlers. e event handlers then execute atomically: the event loop
starts new event handlers but, contrary to preemptive thread schedulers, it can never suspend
or interrupt them. Each event handler is responsible for registering its own event listeners
with the event loop to carry on its task.
An example of event-driven style
Consider the example of counting sh and sheep, re-written in event-driven style (Fig. 1.3).
e rst change with respect to the threaded version is the call to a function
startEventLoop
to launch the program (1): contrary to threads which are part of the OCaml language and have
32
1.1. readed and event-driven styles
le t s t a r tEve n t L oop : uni t - > u ni t = f un () -> (* . .. *)
le t r u n Aft e r : f loa t - > ( u ni t - > u nit ) - > u nit = f un t f -> (* . . . *)
le t r ec cou n t n ani m al =
p ri n t_ i nt n ; pr i n t_ e nd l in e an i ma l ;
r un A ft e r 1 . ( f un ( ) - > c ou n t ( n +1 ) an i ma l ) (* 2 *)
r un A ft e r 0 . ( fu n ( ) - > c o un t 1 " ␣ f is h " ) (* 3 *)
r un A ft e r 0 . 5 ( f u n () - > c o un t 1 " ␣ s h ee p " ) (* 4 *)
sta r t E vent L o o p ( ) (* 1 * )
Figure 1.3: Event-driven
count
an implicit scheduler embedded in the OCaml runtime, there is no event loop to schedule
event handlers provided by the language. It needs to be written by the programmer (a sample
implementation is detailed in Fig. 1.4) and invoked explicitly. Note that this is not always the
case and depends in fact on the concurrency model oered by the language: Javascript, for
instance, provides no thread but oers a
runAfter
function and has an implicit event loop
associated with every program.
e second change, more fundamentally tied to the event-driven model, is the intro-
duction of a function
runAfter
. e purpose of
runAfter
is to register an event handler
function
f
with the event loop to handle a timeout event: calling
runAfter t f
schedules
the execution of the function
f
aer
t
seconds. Hence, instead of sleeping for one second
then calling itself recursively as it did in threaded style, the function
count
registers a callback
with the main event loop to execute its next step one second later (2). Similarly, the tasks
counting sh and sheep are scheduled to start with a half-second interval (3 and 4).
Implementing an event loop
Figure 1.4 shows a naive implementation of the functions
runAfter
and
startEventLoop
used in Fig. 1.3.
We need to keep track of the current time (1) and of the list of timeouts (2). e latter is
represented as a list of pairs: the rst value is the expiration time of the timeout, in seconds
since the Epoch, and the second one is the handler to invoke when the timeout triggers. e
function
runAfter
computes the expiration time (3) and adds the pair to the list of timeouts
(4). e function
startEventLoop
nds the next timeout (5), sleeps for the time remaining
until it expires (6), then looks for expired timeouts (7), removes them from the list (8) and
executes them (9). It loops until the list of timeouts is empty (10).
Note that this implementation is not ecient for a high number of timeouts because it
does not even keep the list of timeouts sorted, hence needing to traverse it every time it looks
for the minimal timeout
tmin
. More ecient data structures, like double-ended queues or
heaps, are used to implement timeout queues in realistic event-driven programs. It is not
complete either: a full-edged event loop would oer functions to listen to more kinds of
events, to stop and to restart event listeners, and it would sometimes use a local rather than
global variable state to enable several event loops to run independently.
33
1. Background
ty p e lo o p _ sta t e = {
mut a b le c u r ren t _ t ime : flo a t (* 1 * )
m ut a bl e t i m eo u ts : ( f l oa t * ( u ni t -> un i t )) li s t ; (* 2 *)
}
le t s t ate = { time o u ts = [] ; c urre n t _ tim e = U nix . g etti m e ofda y ( ) }
le t r u n Aft e r : f loa t - > ( u ni t - > u nit ) - > u nit = f un t f ->
le t t ’ = s tat e . cur r e n t_ti m e + . t in (* 3 * )
st a te . t i me o ut s <- ( t ’ ,f ) : : s ta t e . ti m eo u ts (* 4 *)
le t r ec s tart E v e ntL o o p : u nit -> un i t = fu n ( ) - >
ma t ch s ta t e . ti m eo u ts w it h
| [] -> ()
| l - >
le t t m in = (* 5 * )
Li s t . f o l d_l e f t ( f u n m ( t ,_) -> mi n m t ) 0 . l i n
le t t i m eo u t = ma x ( t mi n -. Un i x . ge t t im e of d ay ( )) 0. in
i gn or e ( U ni x . s el e ct [ ] [ ] [ ] t im e ou t ) ; (* 6 *)
le t c u r re n t = U ni x . g et t im e o fd a y ( ) in
le t ( n ow , l a t er ) = L i st . p a rt i t i o n
( fu n ( t , _ ) - > t <= c ur r e nt ) s ta t e . t i me o u ts i n (* 7 *)
st a te . c u rr e n t_ t im e <- c ur re n t ;
st a t e . t i meo u t s < - l ate r ; (* 8 * )
Li s t . it e r ( f un ( _ , f ) - > f ( )) n ow ; (* 9 *)
sta r t E vent L o o p () (* 1 0 *)
Figure 1.4:
Implementation of an
event loop
Memory footprint
We mentioned that threads must sometimes be avoided because of the
space wasted by their xed-size stacks (page 31). To support this point, we compare the
memory footprint of the thread and events implementations. We start an increasing number
of tasks counting animals and measure the maximum resident size of the process, on a system
with a x86-64 processor and 4 GB of RAM. Figure 1.5 shows the advantage of events over
threads when tens of thousands of tasks are required, or for memory constrained devices.
Each call to
runAfter
allocates around 330 bytes to store the closure
fun () -> count
(n+1) animal
and the list of timeouts; in comparison, each thread allocates a stack of 8 MB.
However, most of these allocations is virtual memory and, in practice, each thread uses 34 kB
of physical memory. As a result, starting 10 000 timeouts uses 3 MB of physical memory,
whereas creating the same number of threads uses 100 times as much memory—and eats up
82 GB of virtual memory. Scheduling 100 000 timeouts uses only 36 MB of memory; creating
that many threads is not even possible on our test system because it exceeds system limits.
e downsides of events
One major downside of event-driven style is that it makes the control ow much harder to
follow. Because event handlers are executed atomically, they must never block: if an event
handler calls a blocking function—for instance, if it sleeps during one second—it will also
block the whole program. Instead, event handlers that need to wait for an operation to
34
1.1. readed and event-driven styles
0
200
400
600
800
1000
0 20000 40000 60000 80000 100000
Maximum resident size (MB)
Number of tasks
Threads (fails beyond 25,000)
Events
Figure 1.5: Memory
footprint of threads and
events
complete must use an asynchronous equivalent, with a callback function invoked when the
event signaling the completion of the operation triggers. For that reason, in an event-driven
program, the control ow of each task is split into many atomic handlers that are linked
together by callbacks around each blocking point.
is property is not obvious in Fig. 1.3 because it consists of a single tail recursive function
count
. ere is therefore no need to split it:
count
registers itself as a handler for the timeout
event. Consider the following function, with a linear control ow but several calls to the
blocking function sleep:
Figure 1.6: Sequential
blocking calls
le t man y S l eep s () =
sl e <