Conference PaperPDF Available

Introducing the PilGRIM: A Processor for Executing Lazy Functional Languages

Authors:

Abstract and Figures

Processor designs specialized for functional languages received very little attention in the past 20 years. The potential for exploiting more parallelism and the developments in hardware technology, ask for renewed investigation of this topic. In this paper, we use ideas from modern processor architectures and the state of the art in compilation,to guide the design of our processor, the PilGRIM. We define a high-level instruction set for lazy functional languages and show the processor architecture, that can efficiently execute these instructions.
Content may be subject to copyright.
Introducing the PilGRIM: A Processor for
Executing Lazy Functional Languages
Arjan Boeijink, Philip K.F. H¨olzenspies, and Jan Kuper
University of Twente
Enschede, The Netherlands
w.a.boeijink@utwente.nl
Abstract. Processor designs specialized for functional languages re-
ceived very little attention in the past 20 years. The potential for ex-
ploiting more parallelism and the developments in hardware technology,
ask for renewed investigation of this topic. In this paper, we use ideas
from modern processor architectures and the state of the art in com-
pilation, to guide the design of our processor, the PilGRIM. We define
a high-level instruction set for lazy functional languages and show the
processor architecture, that can efficiently execute these instructions.
1 Introduction
The big gap between the functional evaluation model, based on graph reduction,
and the imperative execution model of most processors, has made (efficient)
implementation of functional languages a topic of extensive research. Until about
20 years ago, several projects have been undertaken to solve the implementation
problem by designing processors specifically for executing functional languages.
However, advances in the compilation strategies for conventional hardware and
the rapid developments in clock speed of mainstream processor architectures
made it very hard for language specific hardware to show convincing benefits.
In the last 20 years, the processors in PCs have evolved a lot. The number
of transistors of a single core has grown from hundreds of thousands to hun-
dreds of millions following Moore’s law and the clock speed has risen from a
few MHz to a few GHz [5]. Introduction of deep pipelines, superscalar, and out-
of-order execution changed the microarchitecture of processors completely. Cur-
rently, processor designs are limited by power usage instead of transistor count,
which is known as the power wal l . Two other “walls” limiting the single thread
performance of modern processors are the memory wall (memory latency and
bandwidth bottleneck) and the Instruction Level Parallelism (ilp)wall(mostly
due to the unpredictability of control flow). These three walls have shifted the
focus in processor architectures from increasing frequencies to building multicore
processors.
Recent work on the Reduceron [9] showed encouraging results demonstrating
what can be gained by using low-level parallelism to design a processor specifi-
cally for a functional language. The emphasis of this low-level parallelism is on
J. Hage, M.T. Moraz´an (Eds.): IFL 2010, LNCS 6647, pp. 54–71, 2011.
c
Springer-Verlag Berlin Heidelberg 2011
Introducing the PilGRIM 55
simultaneous data movements, but also includes instruction level parallelism.
Functional languages are well suited for applications that are hard to paral-
lelize, such as complex symbolic manipulations. Single thread performance will
be critical when all easy parallelism has been fully exploited. Laziness and first-
class functions make Haskell very data and control flow intensive [10]. Unlike
research done in previous decades, language specific hardware modifications are
now bounded by complexity (of design and verification) and not the number of
transistors.
Performance gains for specialized pure functional architectures might influ-
ence conventional processor architectures, because the trendinmainstream
programming languages is towards more abstraction, first-class functions, and
immutable data structures.
The changes in hardware technology, the resulting modern hardware archi-
tectures, and the positive results of the Reduceron suggest it is time to evaluate
processors for functional languages again.
2 The Pipelined Graph Reduction Instruction Machine
In this paper, we introduce the PilGRIM (the name is an acronym of the section
title). The PilGRIM is a processor with a design that is specialized for execut-
ing lazy functional languages. The architecture is derived from modern general
purpose architectures, with a 64-bit datapath and using a standard memory hier-
archy: separate L1 instruction and data caches, an L2 cache, and DDR-memory
interface. The design targets silicon and is intended to be a realistic design for
current hardware technology. The instruction set is designed to exploit benefits
of extensive higher level compiler optimizations (using the output of ghc). This
is especially important for performance of primitive (arithmetic) operations. The
processor executes a high-level instruction set that is close to a functional core
language and that allows code generation to be simple. The PilGRIM reuses
some of the basic principles behind the Big Word Machine (BWM) [1] and the
Reduceron [8].
We intend to pipeline this processor deep enough to make a clock frequency of
1GHz a feasible target. The details of the pipeline structure are work in progress.
For space reasons, we will not discuss pipelining in this paper, but many design
choices are made in preparation of a deep pipeline.
The contributions of this paper are:
A high-level and coarse-grained instruction set for lazy functional languages,
with a simple compilation scheme from a functional core language.
Proposal of a hardware architecture that can efficiently execute this instruc-
tion set, where this design is made with a deep pipeline in mind.
3 Instruction Set and Compilation
Before designing the hardware, we first want to find a suitable evaluation model
and an instruction set, because designing hardware and an instruction set simul-
taneously is too complex. We chose to use ghc as the Haskell compiler frontend,
56 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
because its External Core [15] feature is a convenient starting point on which to
base a code generator for our new architecture. The extensive set of high-level
optimizations provided by ghc is a big additional benefit.
While External Core is a much smaller language than Haskell, it is still too
complex and too abstract for efficient execution in hardware. Before compiling to
an instruction set, we transform External Core to a simplified and low-level inter-
mediate language, defined in the next section. The instruction set and assembly
language is derived from Graph Reduction Intermediate Notation (grin)[4,3]
(an intermediate language, in the form of a first-order monadic functional lan-
guage). grin has been chosen as our basis, because it is a simple sequential
language, that has a small set of instructions and that is close to a functional
language.
function definition:
d::= fx+=e
|g=e
toplevel expression:
e::= s(simple expr.)
|let bin e(lazy let expr.)
|letS bin e(strict let expr.)
|case sof {a+}(case expr.)
|if cthen eelse e(if expr.)
|fix (λr.frx
)(fixpoint expr.)
|try fx+catch x(catch expr.)
|throw x(throw expr.)
where fis saturated
a::= Cx
e(constructor alternative)
b::= x=s(binding)
c::= xx (primitive comparison)
simple expression:
s::= x(variable)
|n(integer)
|Cx
(constructor [application])
|fx(function [application])
|gx(global const. [application])
|yx
+(variable application)
|⊗x+(primitive operation)
|πnx(proj . of a produc t type)
Fig. 1. Grammar of the simple core language
3.1 A Simple Functional Core Language
As an intermediate step in the compilation process, we want a language that
is both simpler and more explicit, with special and primitive constructs, than
External Core. The simple core language is (like grin) structured using super-
combinators, which means that all lambdas need to be transformed to toplevel
functions (lambda lifting). The grammar of the simple core language is given
in Figure 1. Subexpressions and arguments are restricted to plain variables,
achieved by introducing let expressions for complex subexpressions. An explicit
distinction is made between top level functions and global constants (both have
globally unique names, denoted by fand g). All constructors and primitive op-
erations are fully saturated (all arguments applied). From the advanced type
system in External Core, the types are simplified to the point where only the
distinction between reference and primitive types remains. Strictness is explicit
in this core language, using a strict let, a strict case scrutinee, and strict primi-
tive variables. This core language has primitive (arithmetic) operations and an
Introducing the PilGRIM 57
if-construct for primitive comparison and branching. Exception handling is sup-
ported by a try/catch construct and a throw expression. Let expressions are not
recursive; for recursive values a fixed point combinator is used instead. Selection
from a product type could be done using a case expression, but is specialized
with a projection expression.
3.2 Evaluation Model and Memory Layout
Almost all modern implementations of lazy functional languages use a variation
of the g-machine, which introduced compiled graph reduction [6]. The evaluation
model and memory layout described in this section are derived largely from a mix
of the spineless tagless g-machine (stg), as used in ghc [12], and grin [3]. The
machine model (from grin) consists of an environment and a heap. The environ-
ment maps variables to values, where values can be either primitive values or heap
references. The environment is implemented as a stack of call frames. The heap
maps references to nodes, where a node is a data structure consisting of a tag and
zero-or-more arguments (values). Node tags specify the node’s type and contain
additional metadata, in particular a bitmask denoting which of the node’s argu-
ments are primitive values and which are references. This simplifies garbage collec-
tion. Using metadata in tags—as opposed to using info pointers, as in stg—makes
handling tags more involved, but reduces the number of memory accesses.
Tabl e 1 . Node types
Node
Type Represents Contents
cConstructor Constructor arguments
pPartially applied function Args. and number of missing args.
fFully applied function All arguments for the function
Tags distinguish three basic node types (see Table 1). Both c-andp-nodes
are in Weak Head Normal Form (whnf), while f-nodes are not. Tags for c-nodes
contain a unique number for the data type they belong to and the index of the
constructor therein. The p-andf-tags contain a pointer to the applied function.
The f-nodes are closures, that can be evaluated when required. To implement
laziness, f-nodes are overwritten (updated) with the result after evaluation.
A program is a set of functions, each resulting in an evaluated node on the
stack (i.e., a c-orp-node). Every function has a fixed list of arguments and
consists of a sequence of instructions. Next, we introduce the instruction set, on
the level of an assembly language (i.e., with variables as opposed to registers).
3.3 Assembly Language
Similar to grin, PilGRIM’s assembly language can be seen as a first-order, un-
typed, strict, and monadic language. The monad, in this case, is abstract and
implicit. That is, the programmer can not access its internal state other than
58 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
through the predefined instructions. This internal state is, in fact, the state of the
heap and the stack. Unlike grin, the syntax of PilGRIM’s assembly language is
not based on a “built-in bind structure,” [3] but rather on Haskell’s do-notation.
This, however, imposes considerably more restrictions, viz. in
pattern instruction ;rest
the variables in pattern are bound in rest to the corresponding parts of the result
of instruction . Which pattern is allowed is determined by the instruction. The
grammar of the assembly language (including extensions) is shown in Figure 2.
function implementation:
fun ::= fa+=block
|g=block
basic block:
block ::= instr;term
instruction:
instr ::= xStore Ta
|xPushCAF g
|yPrimOp yy
|yConstant n
|Ta
Call call cont
|xForce call cont
terminator instruction:
term ::= Return Ta
|Jump call cont
|Case call cont alt+
|If (yy)block block
|Throw x
case alternative:
alt ::= Ta
block
callable expression:
call ::= (Eval x)
|(EvalCAF g)
|(TLF fa+)
|(Fix fa)
evaluation continuation:
cont ::= ()
|(Apply a+)
|(Select n)
|(Catch x)
node tag:
T::= Ccon
|Ffun
|Pm
fun
argument or parameter:
a::= x(ref. var.)
|y(prim. var.)
Fig. 2. Grammar of PilGRIM’s assembly language
Like the simple core language, PilGRIM programs are structured using super-
combinators (i.e., only top-level functions have formal parameters). A program
is a set of function definitions, where a function is defined by its name, its for-
mal parameters, and a code block. A block can be thought of as a unit, in that
control flow does not enter a block other than at its beginning and once a block
is entered, all instructions in that block will be executed. Only the last instruc-
tion can redirect control flow to another block. Thus, we distinguish between
instructions and terminator instructions. The former can not be the last of a
block, whereas the latter must be. It follows, that a block is a (possibly empty)
sequence of instructions, followed by a terminator instruction.
As discussed in Section 3.2, nodes consist of a tag and a list of arguments.
When functions return their result, they do this in the form of a node on the top
of the stack. This is why the pattern to bind variables to the result of a Call
must be in the form of a tag with a list of parameters.
Introducing the PilGRIM 59
Basic instructions. We will continue with an informal description of all in-
structions that are essential for a lazy functional language. The appendix con-
tains a more formal description of the semantics of the assembly language in an
operational style.
First, we only consider top-level functions. A top-level function is called by
forming a callable,usingTLF with the name of the function to call and all the
arguments to call it with. Given such a callable, Call pushes the return address
and the arguments onto the stack and jumps to the called function’s code block.
In all cases, a Call has a corresponding Return.Return clears the stack down
to (and including) the return address pushed by Call. Next, it pushes its result
node (tag and arguments) onto the stack, before returning control flow to the
return address.
Next, we consider calls to non-top-level functions. As discussed above, f-nodes
contain fully applied functions. Therefore, f-nodes can be considered callable.
On the level of PilGRIM’s assembly language, however, different types of nodes
can not be identified in the program. To this end, Eval takes a heap reference to
any type of node, loads that node from the heap onto the stack, and turns it into
a callable. Calling Eval xthus behaves differently for different types of nodes.
Since c-andp-nodes are already in whnf,aCall to such a node will implicitly
return immediately (i.e., it simply leaves the node on the stack). f-nodes are
called using the same mechanism as top-level functions.
In Figure 2, Call has another argument, namely a continu ation. In PilGRIM’s
assembly language, continuations are transformations of the result of a call, so
they can be seen as part of the return. The empty continuation, (), leaves the
result unchanged. Select ntakes the nth argument from the c-node residing at
the top of the stack and (if that argument is a reference) loads the corresponding
node from the heap onto the stack. Similarly, Apply works on the p-node at the
topofthestack.Givenalistofarguments,Apply appends its arguments to
those of the p-node. If this saturates the p-node (i.e., if this reduces the number
of missing arguments to nil), the resulting f-node is automatically called, as if
using Call and Eval.
Instead of returning, another way to transfer control flow from the end of
a block to the beginning of another is by using Case.Case can be understood
as a Call combined with a switch statement, or Call as a Case with only one
alternative. After the callable returns a result and the continuation is applied,
the c-node at the top of the stack is matched against a number of cases. If a case
matches, the arguments of the node are bound to the corresponding parameters
of the case and control flow is transferred to the case’s block. Note that in this
transfer of control flow, the stack is unaltered (i.e., the top-most return address
remains the return address for the next Return).
Finally, a node can be written to the heap by the Store instruction. Since
the heap is garbage collected, the addresswhereanodeisstoredisnotinthe
program’s control. Store takes its arguments from the stack, allocates a new
heap node, and then pushes the reference to the new heap node onto the stack.
60 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
Extensions. Because some patterns of instructions are very common and, thus
far, the instruction set is very small, a few extra instructions and combined
instructions are added as optimizations.
First, support for primitive values and operations is added. As a load imme-
diate instruction, Constant produces a primitive constant. Primitive constants
can be fed to primitive operations by means of PrimOp. Control flow can be
determined by comparison operations on primitive values in If. The implemen-
tation of arithmetic operations in PilGRIM is such, that it can be seen as an
independent coprocessor. By letting the operation be a parameter of PrimOp and
If, the instruction set is easily extended with more primitive operations.
Second, we add support for functions in Constant Applicative Form (caf).
These are functions without arguments (i.e., global constants). They are stored
in a fixed position on the heap and require special treatment in case of garbage
collection [11]. To this end, PushCAF generates a constant reference to a function
in caf.Furthermore,EvalCAF is used to make such a function callable. EvalCAF
can be interpreted as a PushCAF, followed by an Eval.
Third, there are some useful optimizations with regards to control flow in-
structions. The common occurrence of tail recursion in lazy functional program-
ming languages calls for a cheaper means than having to Call every recursion.
The Jump instruction is a terminator instruction that redirects control to the
code block of the recursive call, without instantiating a new call frame on the
stack. Another combination with calls is having a Call immediately followed by
aStore. This happens in the case of evaluating references in a strict context.
Force is a Call, followed immediately by a Store of the call’s result.
Finally, for exception handling, we add a terminator instruction Throw and a
continuation Catch. The latter places a reference to an exception handler on the
stack. The former unwinds the stack, down to the first exception handler and
calls that handler with the thrown value as argument.
3.4 Translation of the Core Language to the Instruction Set
The translation from the core language (presented in Section 3.1) to PilGRIM’s
assembly language (Section 3.3) is defined by a four-level scheme. The entry-
point of the translation is T(Figure 3). This scheme translates (strict) top-level
expressions of the core language. It is quite straightforward, except maybe for
the translation of function applications. At this point, the distinction must be
made between saturated an unsaturated function application. The notation α(f)
refers to the arity of function f,whereas|x|denotes the number of arguments in
the core language expression.
Subexpressions can be translated for lazy evaluation (by means of scheme
V), or strict evaluation (scheme S). Note that some subexpressions are only in
scheme S, because they are strict by definition. The lowest level of the translation
is scheme E. This scheme determines the calling method for every expression (i.e.,
both the callable and the continuation).
Introducing the PilGRIM 61
toplevel expression translation:
TCx=Return Ccx
Tfx =Jump Efxif α(f)≤|x|
Return Pfx if α(f)>|x|
Ts=Jump Es(other simple expressions)
Tlet x=sin e=x←Vs;Te
TletS x=sin e=x←Ss;Te
Tfix (λr.frx)=Jump (Fix fx)()
Ttry fx catch h=Jump (TLF fx)(Catch h)
Tthrow x=Throw x
Tcase sof {Cye}=CyCall Es;Te
Tcase sof {a }=Case Es[Cy→Te|(Cye)a]
Tif cthen pelse q=If cTpTq
lazy subexpression translation:
Vgx =
PushCAF gif |x|=0
hPushCAF g;if|x|>0
Store Fap hx
Vfx =
Store Pfx if α(f)>|x|
Store Ffx if α(f)=|x|
hStore Ffy ;wherey =x1,...,x
α(f)if α(f)<|x|
Store Fap hzwhere z =xα(f)+1,...,x
|x|
Vyx=Store Fap yx
Vπnx=Store FSel nx
strict subexpression translation:
Sn=Constant n
Sx =PrimOp x
Sx=Force Ex
SCx=Store Ccx
Syx=Force Eyx
Sπnx=Force Eπnx
Sfx =Force Efxif α(f)≤|x|
Store Pm
fx where m=α(f)−|x|if α(f)>|x|
evaluation expression translation:
Ex=(Eval x)()
Eπnx=(Eval x)(Select n)
Egx =(EvalCAF g)() if |x|=0
(EvalCAF g)(Apply x)if|x|>0
Efx =(TLF fx)() if α(f)=|x|
(TLF fy)(Apply z)wherey,zas in Vfx above if α(f)<|x|
Eyx=(Eval y)(Apply x)
Fig. 3. Translation from simple core to PilGRIM’s assembly language
62 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
4 The PilGRIM Architecture
In this section, we describe a simplified variant of the PilGRIM architecture,
which is complete with regards to the instruction set defined in the previous
section. This variant is not pipelined and does not include many hardware
optimizations.
The Structure of the Architecture
Essential to the understanding of the structure of PilGRIM, is the partitioning
of the memory components. The main memory element is the heap. The heap
contains whole nodes (i.e., addressing and alignment are both based on the size
and format of nodes which is discussed in more detail in Section 4.3). Loads
from and stores to the heap are performed in a single step. That is, memory
buses are also node-aligned. For the larger part, the heap consists of external
ddr-memory. However, PilGRIM has a small allocation heap (see Section 4.2)
to exploit the memory locality present in typical functional programs.
PilGRIM’s assembly language (Section 3.3) is based on a stack-model. The
stack supports (random access) reading and pushing everything that is required
for the execution of an instruction in a single cycle. The stack contains nodes,
values, and call frames. Call frames always contain a return address and may
contain an update continuation and zero or more application continuations. As
discussed in Section 4.1, the stack is not implemented as a monolithic component,
but split up to make parallel access less costly in hardware.
At the logistic heart of the architecture sits a crossbar, which connects the
stack, the heap, and an alu (used for primitive operations). The crossbar can
combine parts from different sources in parallel to build a whole node, or any-
thing else that can be stored or pushed on the stack in a single step. PilGRIM’s
control comes from a sequencer, that calculates instruction addresses, decodes in-
structions, and controls all other components of the architecture. The sequencer
reads instructions from a dedicated code memory.
4.1 Splitting the Stack
The stack contains many different kinds of values: nodes read from memory,
function arguments, return addresses, update pointers, intermediate primitive
values, and temporary references to stored nodes. Every instruction reads and/or
writes multiple values from/to the stack. All these data movements make the
stack and the attached crossbar a critical central point of the core. The mix
of multi-word nodes, single-word values, and variable-sized groups of arguments
makes it very hard to implement a stack as parallelized as required, without
making it big and slow. A solution to this problem is to split the stack into
multiple special-purpose stacks. The second reason to split up the stack is to
make it possible to execute a complete instruction at once (by having a separate
stack for each aspect of an instruction).
The return/update stack contains the return addresses, update references, and
counters for the number of nodes and continuations that belong to a callframe.
Introducing the PilGRIM 63
The continuation stack contains arguments to be applied or other simple contin-
uations to be executed, between a return instruction and the actual jump to the
return address. The node stack only contains complete nodes (including their
tag). Values in the topmost few entries of the node stack can be read directly.
Every instruction can read from and pop off any combination of these top en-
tries. Reading from, popping from and pushing onto the node stack can take
place simultaneously. The reference queue contains the references to the most
recently stored nodes. The primitive queue contains the most recently produced
primitive values.
The return stack and continuation stack are simple stacks with only push/pop
functionality. The queues are register file structures with multiple read ports,
that contain the nmost recently written values. When writing a new value into
a queue, the oldest value in the queue is lost.
A very parallel stack, such as the node stack, requires many read and write
ports in its hardware implementation, which make it slow and big. By storing
the top of the stack in separate registers, the number of read ports for the rest
of the stack can be reduced, because the top element is accessed most often.
Pushing can be faster, because the pushed data can only go to a single place and
the top of stack register can be placed close to the local heap memory. The other
stacks also have a top of stack register for fast access (not shown in Figure 4). All
stacks are implemented using register files for the top few entries and backed up
by a local memory. Transfers between the stack registers and the stack memory
are handled automatically in the background by the hardware.
heap memory
allocation heap
load unit store unit
t
agtag top of stack
t
agtag node stack continuation
stack
update
stack
return
stack
X
ref. queue
prim. queue
ALU
control
code
memory
whole node
2–4 words
1word
general stack
dede
sta
c
stac
stackstack
n
dededede
Fig. 4. The basic PilGRIM architecture
64 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
4.2 Allocation Heap
Typically as much as 25% of the executed instructions are stores, requiring a lot
of allocation bandwidth. The percentage of instructions reading from the heap
can be even higher than that. It is crucial that most loads and stores be fast,
which can be achieved by using a small local memory.
We can make allocation more efficient, by storing all newly allocated nodes
in a sizable buffer (the allocation heap) of several hundreds of nodes first. The
allocation heap serves, in effect, as a fast, directly mapped, tagless data cache.
We have chosen 512 elements of four words as the size for the allocation heap.
This is 16 kilobytes of memory. The allocation heap exploits the temporal locality
between the allocation of and reading back of the same data. To save bandwidth
to external memory, nodes with a short lifetime are garbage collected while
still in the allocation heap. This is implemented in hardware using a reference
counting mechanism (extending the one-bit reference counters from SKIM [14]).
4.3 Hardware Component Sizes and Constraints
We have chosen the sizes of hardware components to be small enough to ensure
that the cost in area and latency are reasonable, but also big enough to not limit
performance too much. Some hardware constraints need transformations in the
code generator to work around them.
The datapath is 64 bits wide and all data in memory and registers is orga-
nizedandaddressedin64bitwords.
Nodes are limited to eight words (one tag word and seven values). The
compiler has to transform the code to eliminate all constructors and function
applications that are too big, as is done in the BWM and the Reduceron.
The heap is divided in four word elements, because most of the nodes on
the heap fit in four words [1,9]. Using wider elements adds a lot to hardware
cost and only gives a small performance benefit. If a node is wider than four
words, reading or writing nodes takes extra cycles. Updating a small f-node
with a big result node is done using indirection nodes, as in the STG [12].
Only the top four entries of the node stack can be read. The code generator
ensures that all required values from the stack are in the top four nodes.
Both the reference queue and the primitive queue are limited to the last 16
values produced. Older values need to be pushed onto the node stack.
The continuation stack is two words wide. This is enough for most function
applications [7]. For bigger applications, arguments can be pushed onto the
continuation stack in multiple steps.
4.4 Instruction Set
The assembly language (defined in Section 3.3) and the instruction set differ only
in two important details. First, all variables are replaced by explicit indices to
elements on the stacks or queues. For reference values, copying/destructive reads
Introducing the PilGRIM 65
are explicitly encoded, to have accurate live reference information for garbage
collection. Second, the structure of programs is transformed to a linear sequence
of instructions, by using jump offsets for the else branches and case alternatives.
Instruction Set Encoding. All instructions can be encoded in 128 bits, although
the majority of instructions fit within 64 bits. The mix of two instruction lengths
is a trade off between fast (fixed length) instruction decoding and small code
size (where variable length instructions are better). Small instructions are all
arithmetic operations, constant generation, if-with-compare, and many other
instructions with only a few operands. Operands are not only stack and queue
indices, but can also be small constant values. The distinction between primitive
and reference values is explicit in the operands. All instructions have a common
header structure, with the opcode and a bitmask to optionally clear the queues
or any of the top stack nodes.
Generating Instructions. The process of generating actual instructions from the
assembly language is straightforward. Every constructor is assigned a unique
number, where the numbers are grouped in such a way, that the lowest bits
distinguish between the alternatives within a data type. Values on the node
stack are indexed from the top by a number starting with zero, and values in
the queues are indexed from the most recently produced entry in the queue.
Every instruction that produces a result pushes its result on either the node
stack or one of the queues. Thus the sequence of executed instructions within
a supercombinator determines the translation of variables in the assembly to
stack or queue indices (without any ‘register’ allocation). The last step is linking
together all supercombinators, which assigns an instruction address to every use
of a function name.
4.5 Hardware Implementation Strategy
The implementation of the PilGRIM is done in a series of Haskell programs, each
one more detailed and lower level than the previous one. A high-level evaluator of
the instruction set is the first step in this process. Following steps include: choos-
ing sizes for memory structures, adding lower level optimizations, and splitting
up functions to follow the structure of the hardware architecture.
We intend to use CλaSH [2] in the final step to produce a synthesizable imple-
mentation of the PilGRIM. CλaSH is a subset of Haskell that can be compiled
to VHDL (a hardware description language supported by many existing tools).
We have gone through several iterations of the first few steps of this design
process, starting from roughly grin as the instruction set on a simple single
stack architecture, up to the current design. Making the step from a single cycle
model to a fully pipelined architecture is ongoing work.
5 Evaluation and Comparison
For measurements in this section, we make use of the benchmark set of the
Reduceron [9]. We selected these benchmarks, because they require only minimal
66 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
Tabl e 2 . Measurements of the performance of the instruction set
program
PilGRIM
instrs.
[million]
instr. rate
to match
PC [MHz]
arithmetic
instrs.
[%]
node store
instrs.
[%]
Reduceron
cycles
[million]
PermSort 166 231 4 34 154
Queens 84 645 42 10 96
Braun 83 347 235 66
OrdList 98 271 032 94
Queens2 111 278 0 32 120
MSS 86 728 24 1 66
Adjoxo 23 378 23 13 37
Taut 39 352 518 54
CountDown 16 324 25 17 18
Clausify 46 284 517 67
While 46 359 924 56
SumPuz 277 338 12 18 362
Cichelli 29 587 19 6 39
Mate 343 528 15 6 628
KnuthBendix 15 302 818 16
support of primitive operations and the memory usage and run times are small
enough to use them in simulation. The results come from an instruction-level
simulator we developed for the PilGRIM. All measurements exclude the garbage
collection overhead. GHC 6.10.4 (with the option -O2) was used for running
the benchmarks on a PC and to produce to the simple core language for the
PilGRIM code generator.
For 11 out of the 15 benchmarks in Table 2, 400 million PilGRIM instructions
per second is enough to match the performance of GHC on a 2.66GHz Intel i5
PC. The benchmarks with a high percentage of arithmetic instructions do not
perform so well on the PilGRIM, while the best performing ones are store inten-
sive. The arithmetic performance of the PilGRIM could be improved, but the
architecture has no inherent advantage on this aspect over other architectures.
For a deeply pipelined processor the performance in instructions says little
about the actual performance. Pipeline stalls due to control flow and latency of
memory operations are a big factor.
It is too early to compare with the Reduceron directly on performance, but
we can compare the instruction set approach of the PilGRIM versus template
instantiation in the Reduceron. For most benchmark programs the difference is
fairly small (note that one is measured in instructions and the other in cycles).
The Reduceron does better on Braun and MSS, due to requiring one step less in
a critical small inner ’loop’ in both programs. Where the numbers for PilGRIM
are significantly lower, such as in Adjoxo and Mate, we believe they can entirely
attributed to the optimizations applied by GHC.
Introducing the PilGRIM 67
5.1 Related Work
Between roughly 1975 and 1990, a lot of work was done on the design of Lisp
machines and combinator based processors [14,13]. Big differences in implemen-
tation strategy and hardware technology leaves little to directly compare this
work to. The only processor designs comparable to our work are Augustsson’s
Big Word Machine (BWM) [1] from 1991 and the recent work by Naylor and
Runciman on the Reduceron [8,9]. Our work was inspired by the Reduceron,
because of its promising results. The unusual architecture of the Reduceron
leaves room (from a hardware perspective) for improvements and many alter-
native design choices. The BWM, Reduceron and PilGRIM have in common
that they all focus on exploiting the potential parallelism in data movements
inherent to functional languages, by reading multiple values in parallel from the
stack and rearranging them through a large crossbar in every cycle. Advances
in hardware technology allowed the Reduceron and the PilGRIM to go a step
further than the BWM, by also using a wide heap memory and adding special-
purpose stacks, that can be used in parallel. Both the BWM and the Reduceron
choose to encode data constructors and case expression in functions for hard-
ware simplicity. The Reduceron adds special hardware to speed up handling
these function-application-encoded case expressions. The Reduceron is based on
template instantiation, while the BWM uses a small instruction set, based on the
G-machine [6]. Unfortunately, the BWM was only simulated and never built. The
Reduceron has been implemented on an FPGA, achieving a clock speed close to
100MHz. The Reduceron executes one complete reduction step per cycle.
While the Reduceron achieves surprisingly high performance given its simplic-
ity, the single cycle nature of its design is the limiting factor in performance. It
will have a relatively low clock speed even in a silicon implementation, because
the latency of memory reads and writes performed within every cycle. With the
extensive use of pipelining the PilGRIM targets a high clock frequency which
comes with a strong increase in the complexity of the design.
6 Conclusions
It is feasible to design an efficient processor for lazy functional languages using a
high-level instruction set. We designed the instruction set to match closely with
the source language, so that code generation is (relatively) simple. Lazy func-
tional languages can expose a lot of low-level parallelism, even in an instruction
set based architecture. Most of this parallelism is in data movements, and can be
exploited by using wide memories and many parallel stacks. A large part of the
abstraction overhead typically incurred by high-level languages can be removed,
by adding specialized hardware.
Measurements have shown that the designed instruction set is better suited
for the functional language domain than the instruction set of typical general
purpose processors. Combined with the results from the Reduceron, we con-
clude that the potential exploitation of low-level parallelism is inherent to lazy
68 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
functional languages. This form of parallelism is not restricted to a single im-
plementation strategy, but works both for template instantiation and a grin-
derived instruction set. The advantage of using an instruction set based and
pipelined design is that many existing hardware optimization techniques can be
applied to the PilGRIM.
As for absolute performance numbers compared to desktop processors: the
big open question is how often the pipeline of this core will stall on cache misses
and branches. To become competitive with current general purpose processors,
the PilGRIM needs to execute about 500 million instructions per second. We
might be able to achieve that with the not unreasonable numbers of a 1 GHz
operating frequency and executing an instruction every other cycle on average.
The optimal trade off between frequency and pipeline stalls is an open question,
and we might need to settle for a lower frequency in a first version, due to the
complexity of a deep pipelined design.
Future Work. The next step is finishing a fully pipelined, cycle accurate sim-
ulation model with all optimizations applied. Most optimizations under consid-
eration are to avoid pipeline stalls and to reduce the latency and bandwidth of
memory accesses. Another area for optimization is the exploitation of potential
parallelism in arithmetic operations. Then, we plan to transform the simula-
tion model into a synthesizable hardware description of this architecture, and
make it work on an FPGA, so that real programs can be benchmarked with it.
Once this processor is complete, a lot of interesting research possibilities open
up, like building a multicore system with it and making the core run threads
concurrently by fine-grained multithreading.
Acknowledgements. We are grateful to the Reduceron team for making their
benchmark set available online and providing raw numbers of the Reduceron.
We also thank the anonymous reviewers, Kenneth Rovers, Christiaan Baaij and
Raphael Poss for their extensive and helpful comments on earlier versions of this
paper.
References
1. Augustsson, L.: BWM: A concrete machine for graph reduction. Functional Pro-
gramming, 36–50 (1991)
2. Baaij, C.P.R., Kooijman, M., Kuper, J., Boeijink, W.A., Gerards, M.E.T.: Cλash:
Structural descriptions of synchronous hardware using haskell. In: Proceedings of
the 13th EUROMICRO Conference on Digital System Design: Architectures, Meth-
ods and Tools, pp. 714–721 (September 2010)
3. Boquist, U.: Code Optimisation Techniques for Lazy Functional Languages. Ph.D.
thesis, Chalmers University of Technology (April 1999),
http://www.cs.chalmers.se/~boquist/phd/phd.ps.gz
4. Boquist, U., Johnsson, T.: The GRIN project: A highly optimising back end for
lazy functional languages. Implementation of Functional Languages, 58–84 (1996)
5. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Ap-
proach, 4th edn. Morgan Kaufmann, San Francisco (2006)
Introducing the PilGRIM 69
6. Johnsson, T.: Efficient compilation of lazy evaluation. In: SIGPLAN Symposium
on Compiler Construction, pp. 58–69 (1984)
7. Marlow, S., Peyton Jones, S.L.: Making a fast curry: push/enter vs. eval/apply for
higher-order languages. In: ICFP, pp. 4–15 (2004)
8. Naylor, M., Runciman, C.: The reduceron: Widening the von neumann bottleneck
for graph reduction using an FPGA. In: Chitil, O., Horv´ath, Z., Zs´ok, V. (eds.)
IFL 2007. LNCS, vol. 5083, pp. 129–146. Springer, Heidelberg (2008)
9. Naylor, M., Runciman, C.: The reduceron reconfigured. In: ICFP (2010)
10. Nethercote, N., Mycroft, A.: The cache behaviour of large lazy functional programs
on stock hardware. In: MSP/ISMM, pp. 44–55 (2002)
11. Peyton Jones, S.L.: The Implementation of Functional Programming Languages.
Prentice-Hall, Inc., Englewood Cliffs (1987)
12. Peyton Jones, S.L.: Implementing lazy functional languages on stock hardware:
The spineless tagless g-machine. J. Funct. Program. 2(2), 127–202 (1992)
13. Scheevel, M.: Norma: A graph reduction processor. In: LISP and Functional Pro-
gramming, pp. 212–219 (1986)
14. Stoye, W.R., Clarke, T.J.W., Norman, A.C.: Some practical methods for rapid
combinator reduction. In: LISP and Functional Programming, pp. 159–166 (1984)
15. Tolmach, A.: An external representation for the GHC core language (2001)
70 A. Boeijink, P.K.F. H¨olzenspies, and J. Kuper
ASemantics
We chose to present the semantics on the level of assembly language, instead of
the instruction set, in order to describe how the PilGRIM works without getting
lost in less important details.
The processor state H||S||Qis a triple consisting of the heap Hwith bindings
from references to nodes, the stack S(a sequence of stack frames), and a queue
Q(a local sequence of temporary values). Each frame N|C|Ron stack Shas
three parts: the local node stack N(a nonempty sequence of nodes), the optional
local update/apply/select continuation stack C, and a return target R. Possible
return targets are: returning a node or reference (Rnto/Rrto) to an instruction
sequence, returning into case statement Rcase, continuing with the next stack
frame Rnext, or completing the main function Rmain.
The semantics are written in an operational style, structured by the various
modes of operation. These modes I,C,R,E,andW, are explained below. Fig-
ure 5 contains the semantics for executing instructions indicated by the mode I,
with the instruction block as first argument. The callee mode Cis only a nota-
tional construct to avoid the combinatorial explosion between what is called and
how it is called. Underlining of variables denotes the conversion from stack/queue
indices to values (by reading from the stack and/or queue).
A program starts in evaluation Emode, with the following initial state:
−−−
g→ Fg||Fentry a ||Rmain||. The heap is initialized with a vector of bind-
ings from constant references to nullary function nodes corresponding to all
CAFs in the program. Initially the stack contains a single frame with the entry
function with its arguments Fentr ya on the node stack. The queue and the extra
continuation stack start empty, denoted by .
I(Store tx);ı H||S||QIı H, y → tx||S||y, Q,y=newRef(H)
I(PushCAF g);ı H||S||QIı H||S||g, Q
I(PrimOp x);ı H||S||QIı H||S||y, Q,y=x
I(Constant n);ı H||S||QIı H||S||n, Q
I(Call ce);ı H||S||QCce(Rnto ı )H||S||Q
I(Force ce);ı H||S||QCce(Rrto ı )H||S||Q
I(Jump ce)H||S||QCceRnext H||S||Q
I(Case cea)H||S||QCce(Rcase a)H||S||Q
I(Return tx)H||S||QR(tx)H||S||Q
I(If (xy)te)H||S||QItH||S||Q,ifx y
IeH||S||Q,otherwise
I(Throw x)H||S||QWxH||S||
C(Eval x)erH||S||QEH||n|e|r,S||x, n = load(H,x)
C(EvalCAF g)erH||S||QEH||n|e|r,S||g, n = load(H,g)
C(TLF fx)erH||S||QIı H||Ffx |e|r,S||Q,ı =code(f)
C(Fix fx)erH||S||QIı H||Ffx |Update y, e|r,S||y, Q,y=newRef(H)
Fig. 5. Semantics of the assembly language instructions
Introducing the PilGRIM 71
RnH||N|C|R,S||QEH||n|C|R,S||
EH||Ffx |C|R,S||uIı H||Ffx |Update u, C |R,S||,ı =code(f)
EH||Ffx |C|R,S||Iı H||Ffx |C|R,S||,ı =code(f)
EH||n|Update u, C|R,S||EH, u → n||n|C|R,S||
EH||n|Catch h, C|R,S||EH||n|C|R,S||
EH||Ccx |Select i, C|R,S||EH||n|C|R,S||xi, n = load(H, xi)
EH||Pm
fx |Apply a, C|R,S||
EH||Pm−|a|
f[x, a]|C|R,S||,ifm>|a|
EH||Ff[x,a]|C|R,S||,ifm=|a|
EH||Ff[x, a0,.,a
m1]|Apply [am,.,a
|a|−1],C|R,S||,ifm<|a|
EH||n||Rnext,N|C|R,S||EH||n|C|R,S||
EH||n||Rnto ı ,N|C|R,S||Iı H||n, N |C|R,S||
EH||n||Rrto ı ,S||Iı H, y → n||S||y,y=newRef(H)
EH||Ccx ||Rcase a,N|C|R,S||Iı H||Ccx, N|C|R,S||,ı =selectAlt(a,c)
EH||n||Rmain||H||n||||(end of program)
WxH||N|Catch h, C|R,S||EH||n|Apply x, C|R,S||h, n = load(H,h)
WxH||N|c, C|R,S||WxH||N|C|R,S||
WxH||N||Rmain||H||||(exit with exception)
WxH||N||R,S||WxH||S||
Fig. 6. Semantics of the stack evaluation/unwinding
Figure 6 contains the stack evaluation mode E, and exception unwinding mode
W. Here, the order of the rules does matter: if multiple rules could match,
the topmost one is taken. To avoid too many combinations, the rules for the
evaluation mode are split up in multiple small steps. Returning a node in mode
Ris (after pushing the node onto the stack) identical to evaluation (of a node
loaded from the heap). Thus it can share the following steps.
... A much more recent paper describes work on a processor design called PilGRIM for executing lazy functional languages (Boeijink et al., 2011). The PilGRIM authors aim to explore a design that differs in two main ways from the Reduceron: (1) Use of an assembly-level instruction-set as opposed to template instantiation; and (2) use of hardware pipelining techniques to permit high clock frequencies. ...
... In Boeijink et al. (2011), the PilGRIM instruction set and a compiler from a simple core functional language are presented. An actual synthesisable hardware description of the PilGRIM has not yet been produced; however, an instruction set simulator has been developed to gauge performance. ...
Article
A new version of a special-purpose processor for running lazy functional programs is presented. This processor – the Reduceron – exploits parallel memories and dynamic analyses to increase evaluation speed, and is implemented using reconfigurable hardware. Compared to a more conventional functional language implementation targeting a standard RISC processor running on the same reconfigurable hardware, the Reduceron offers a significant improvement in run-time performance.
Combinators have a long history in mathematics, logic and computer science, as simple primitive symbols with which complex relationships can be described. In practice, this simplicity comes with a cost, impacting the performance of combinator-based languages and computers. We propose a generalized representation for combinators, not as primitives, but as self-contained definitions, where the structure of their graph-reduction semantics is explicit. For a sample of 798 unique lambda-terms from common benchmark programs, we show that structured combinators can improve the quality of compiled code, generating smaller and more efficient graphs, while translating seamlessly to a machine-friendly encoding, as part of the open-source fun instruction-set architecture.
Conference Paper
It is hard to reason about the state of a multicore system-on-chip, because operations on memory need multiple cycles to complete, since cores communicate via an interconnect like a network-on-chip. To simplify programming, atomicity is required, by means of atomic read-modify-write (RMW) operations, a strong memory model, and hardware cache coherency. As a result, multicore architectures are very complex, but this stems from the fact that they are designed with an imperative programming paradigm in mind, i.e. based on threads that communicate via shared memory. In this paper, we show the impact on a multicore architecture, when the programming paradigm is changed and a λ-calculus-based (functional) language is used instead. Ordering requirements of memory operations are more relaxed and synchronization is simplified, because λ-calculus does not have a notion of state or memory, and therefore does not impose ordering requirements on the platform. We implemented a functional language for multicores with a weak memory model, without the need of hardware cache coherency, any atomic RMW operation, or mutex---the execution is atomic-free. Experiments show that even on a system with (transparently applied) software cache coherency, execution scales properly up to 32 cores. This shows that concurrent hardware complexity can be reduced by making different choices in the software layers on top.
Conference Paper
Full-text available
Higher-order languages that encourage currying are implemented using one of two basic evaluation models: push/enter or eval/apply. Implementors use their intuition and qualitative judgements to choose one model or the other.Our goal in this paper is to provide, for the first time, a more substantial basis for this choice, based on our qualitative and quantitative experience of implementing both models in a state-of-the-art compiler for Haskell.Our conclusion is simple, and contradicts our initial intuition: compiled implementations should use eval/apply.
Conference Paper
Full-text available
CλaSH is a functional hardware description language that borrows both its syntax and semantics from the functional programming language Haskell. Polymorphism and higher-order functions provide a level of abstraction and generality that allow a circuit designer to describe circuits in a more natural way than possible with the language elements found in the traditional hardware description languages. Circuit descriptions can be translated to synthesizable VHDL using the prototype CλaSH compiler. As the circuit descriptions, simulation code, and test input are also valid Haskell, complete simulations can be done by a Haskell compiler or interpreter, allowing high-speed simulation and analysis.
Article
This paper describes the principles underlying an efficient implementation of a lazy functional language, compiling to code for ordinary computers. It is based on combinator-like graph reduction: the user defined functions are used as rewrite rules in the graph. Each function is compiled into an instruction sequence for an abstract graph reduction machine, called the G-machine, the code reduces a function application graph to its value. The G-machine instructions are then translated into target code. Speed improvements by almost two orders of magnitude over previous lazy evaluators have been measured; we provide some performance figures.
Conference Paper
This paper describes the principles underlying an efficient implementation of a lazy functional language, compiling to code for ordinary computers. It is based on combinator-like graph reduction: the user defined functions are used as rewrite rules in the graph. Each function is compiled into an instruction sequence for an abstract graph reduction machine, called the G-machine, the code reduces a function application graph to its value. The G-machine instructions are then translated into target code. Speed improvements by almost two orders of magnitude over previous lazy evaluators have been measured; we provide some performance figures.
Article
A complete compiler back-end for lazy functional languages which uses various interprocedural optimizations to produce highly optimized code is described. Important contributions of this work are presented. For a set of small to medium-sized Haskell programs taken from the nofib benchmark suite, the code produced by the back-end executes several times faster than the code produced by some other compilers.
Conference Paper
For the memory intensive task of graph reduction, modern PCs are limited not by processor speed, but by the rate that data can travel between processor and memory. This limitation is known as the von Neumann bottleneck. We explore the effect of widening this bottleneck using a special-purpose graph reduction machine with wide, parallel memories. Our prototype machine – the Reduceron – is implemented using an FPGA, and is based on a simple template-instantiation evaluator. Running at only 91.5MHz on an FPGA, the Reduceron is faster than mature bytecode implementations of Haskell running on a 2.8GHz PC.
Conference Paper
The SKIM II processor is a microcoded hardware machine for the rapid evaluation of functional languages. This paper gives details of some of the more novel methods employed by SKIM II, and resulting performance measurements. The authors conclude that combinator reduction can still form the basis for the efficient implementation of a functional language.
Conference Paper
The leading implementations of graph reduction all target conven- tional processors designed for low-level imperative execution. In this paper, we present a processor specially designed to perform graph-reduction. Our processor - the Reduceron - is implemented using off-the-shelf reconfigurable hardware. We highlight the low- level parallelism present in sequential graph reduction, and show how parallel memories and dynamic analyses are used in the Re- duceron to achieve an average reduction rate of 0.55 function ap- plications per clock-cycle. Categories and Subject Descriptors C.1.3 (Processor Architec- tures): Other Architecture Styles—High-level language architec- tures; D.3.4 (Programming Languages): Processors—Run-time environments; I.1.3 (Symbolic and Algebraic Manipulation): Lan- guages and Systems—Special-Purpose Hardware
Conference Paper
Lazy functional programs behave differently from imperative programs and these differences extend to cache behaviour. We use hardware counters and a simple yet accurate execution cost model to analyse some large Haskell programs on the x86 architecture. The programs do not interact well with modern processors---L2 cache data miss stalls and branch misprediction stalls account for up to 60% and 32% of execution time respectively. Moreover, the program code exhibits little exploitable instruction-level parallelism.We then use simulation to pinpoint cache misses at the instruction level. With this information we apply prefetching to minimise the cost of write misses, speeding up Haskell programs by up to 22%. We conclude with more ideas for changing the Glasgow Haskell Compiler and its garbage collector to improve the cache performance of large programs.