Conference PaperPDF Available

Stream Fusion. From Lists to Streams to Nothing at All

Authors:

Abstract

This paper presents an automatic deforestation system, stream fusion, based on equational transformations, that fuses a wider range of functions than existing short-cut fusion systems. In particular, stream fusion is able to fuse zips, left folds and functions over nested lists, including list comprehensions. A distinguishing feature of the framework is its simplicity: by transforming list functions to expose their structure, intermediate values are eliminated by general purpose compiler optimisations. We have reimplemented the Haskell standard List library on top of our framework, providing stream fusion for Haskell lists. By allowing a wider range of functions to fuse, we see an increase in the number of occurrences of fusion in typical Haskell programs. We present benchmarks documenting time and space improvements.
Stream Fusion
From Lists to Streams to Nothing at All
Duncan Coutts1Roman Leshchinskiy2Don Stewart2
1Programming Tools Group 2Computer Science & Engineering
Oxford University Computing Laboratory University of New South Wales
duncan.coutts@comlab.ox.ac.uk {rl,dons}@cse.unsw.edu.au
Abstract
This paper presents an automatic deforestation system, stream fu-
sion, based on equational transformations, that fuses a wider range
of functions than existing short-cut fusion systems. In particular,
stream fusion is able to fuse zips, left folds and functions over
nested lists, including list comprehensions. A distinguishing fea-
ture of the framework is its simplicity: by transforming list func-
tions to expose their structure, intermediate values are eliminated
by general purpose compiler optimisations.
We have reimplemented the Haskell standard List library on top
of our framework, providing stream fusion for Haskell lists. By al-
lowing a wider range of functions to fuse, we see an increase in the
number of occurrences of fusion in typical Haskell programs. We
present benchmarks documenting time and space improvements.
Categories and Subject Descriptors D.1.1 [Programming Tech-
niques]: Applicative (Functional) Programming; D.3.4 [Program-
ming Languages]: Optimization
General Terms Languages, Algorithms
Keywords Deforestation, program optimisation, program trans-
formation, program fusion, functional programming
1. Introduction
Lists are the primary data structure of functional programming. In
lazy languages, such as Haskell, lists also serve in place of tra-
ditional control structures. It has long been recognised that com-
posing list functions to build programs in this style has advantages
for clarity and modularity, but that it incurs some runtime penalty,
as functions allocate intermediate values to communicate results.
Fusion (or deforestation) attempts to remove the overhead of pro-
gramming in this style by combining adjacent transformations on
structures to eliminate intermediate values.
Consider this simple function which uses a number of interme-
diate lists:
f:: Int Int
f n =sum [km|k[1..n],m[1..k] ]
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ICFP’07,
October 1–3, 2007, Freiburg, Germany.
Copyright © 2007 ACM 978-1-59593-815-2/07/0010. . . $5.00.
No previously implemented short-cut fusion system eliminates all
the lists in this example. The fusion system presented in this paper
does. With this system, the Glasgow Haskell Compiler (The GHC
Team 2007) applies all the fusion transformations and is able to
generate an efficient “worker” function f0that uses only unboxed
integers (Int#) and runs in constant space:
f0:: Int#Int#
f0n=
let go :: Int#Int#Int#
go z k =
case k>nof
False case 1>kof
False to (z+k)k(k+ 1) 2
True go z (k+ 1)
True z
to :: Int#Int#Int#Int#Int#
to z k k0m=
case m>kof
False to (z+ (km)) k k0(m+ 1)
True go z k0
in go 0 1
Stream fusion takes a simple three step approach:
1. Convert recursive structures into non-recursive co-structures;
2. Eliminate superfluous conversions between structures and co-
structures;
3. Finally, use general optimisations to fuse the co-structure code.
By transforming pipelines of recursive list functions into non-
recursive ones, code becomes easier to optimise, producing better
results. The ability to fuse all common list functions allows the pro-
grammer to write in an elegant declarative style, and still produce
excellent low level code. We can finally write the code we want to
be able to write without sacrificing performance!
1.1 Short-cut fusion
The example program is a typical high-level composition of list
producers, transformers and consumers. However, extensive opti-
misations are required to transform programs written in this style
into efficient low-level code. In particular, naive compilation will
produce a number of intermediate data structures, resulting in poor
performance. We would like to have the compiler remove these
intermediate structures automatically. This problem, deforesta-
tion (Wadler 1990), has been studied extensively (Meijer et al.
1991; Gill et al. 1993; Takano and Meijer 1995; Gill 1996; Hu
et al. 1996; Chitil 1999; Johann 2001; Svenningsson 2002; Gibbons
2004). To illustrate how our approach builds on previous work on
short-cut fusion, we review the main approaches.
build/foldr The most practically successful list fusion system to
date is the build/foldr system (Gill et al. 1993). It uses two combina-
tors, foldr and build, and a single fusion rule to eliminate adjacent
occurrences of the combinators. Fusible functions must be written
in terms of these two combinators. A range of standard list func-
tions, and list comprehensions, can be expressed and effectively
fused in this way.
There are some notable exceptions that cannot be effectively
fused under build/foldr: left folds (foldl ) (functions such as sum
that consume a list using an accumulating parameter), and zips
(functions that consume multiple lists in parallel).
destroy/unfoldr A more recent proposal (Svenningsson 2002)
based on unfolds rather than folds addresses these specific short-
comings. However, as proposed, it does not cover functions that
handle nested lists (such as concatMap) or list comprehensions,
and there are inefficiencies fusing filter-like functions, which must
be defined recursively.
stream fusion Recently, we proposed a new fusion framework for
operations on strict arrays (Coutts et al. 2007). While the perfor-
mance improvements demonstrated for arrays are significant, this
previous work describes fusion for only relatively simple opera-
tions: maps, filters and folds. It does not address concatenations,
functions on nested lists, or zips.
In this paper we extend stream fusion to fill in the missing pieces.
Our main contributions are:
an implementation of stream fusion for lists (Section 2);
extension of stream fusion to zips, concats, appends (Section 3)
and functions on nested lists (Section 4);
a translation scheme for stream fusion of list comprehensions
(Section 5);
an account of the compiler optimisations required to remove
intermediate structures produced by fusion, including functions
on nested lists (Section 7);
an implementation of stream fusion using compiler rewrite rules
and concrete results from a complete implementation of the
Haskell list library (Section 8).
2. Streams
The intuition behind build/foldr fusion is to view lists as sequences
represented by data structures, and to fuse functions that work
directly on the natural structure of that data. The destroy/unfoldr
and stream fusion systems take the opposite approach. They convert
operations over the list data structure to instead work over the dual
of the list: its unfolding or co-structure.
In contrast to destroy/unfoldr, stream fusion uses an explicit rep-
resentation of the sequence co-structure: the Stream type. Separate
functions, stream and unstream, are used to convert between lists
and streams.
2.1 Converting lists to streams
The first step in order to fuse list functions with stream fusion is
to convert a function on list structures to a function on stream co-
structures (and back again) using stream and unstream combina-
tors. The function map, for example, is simply specified as:
map :: (ab)[a][b]
map f =unstream ·mapsf·stream
which composes a map function over streams, with stream conver-
sion to and from lists.
While the natural operation over a list data structure is a fold,
the natural operation over a stream co-structure is an unfold. The
Stream type encapsulates an unfold, wrapping an initial state and
a stepper function for producing elements. It is defined as:
data Stream a =s.Stream (sStep a s )s
data Step a s =Done
|Yield a s
|Skip s
Note that the type of the stream state is existentially quantified
and does not appear in the result type: the stream state is encapsu-
lated. The Stream data constructor itself is similar to the standard
Haskell list function unfoldr (Gibbons and Jones 1998),
Stream :: s a .(sStep a s )sStream a
unfoldr :: s a.(sMaybe (a,s)) s[a]
Writing functions over streams themselves is relatively straight-
forward. map, for example, simply applies its function argument to
each yielded stream element, when the stepper is called:
maps:: (ab)Stream a Stream b
mapsf(Stream next0s0) = Stream next s0
where
next s =case next0sof
Done Done
Skip s0Skip s 0
Yield x s0Yield (f x )s0
The stream function can be defined directly as a stream whose
elements are those of the corresponding list. It uses the list itself as
the stream state. It is of course non-recursive yielding each element
of the list as it is unfolded:
stream :: [a]Stream a
stream xs0=Stream next xs0
where
next [ ] = Done
next (x:xs) = Yield x xs
The unstream function unfolds the stream and builds a list
structure. Unfolding a stream to produce a list is achieved by
repeatedly calling the stepper function of the stream, to yield the
stream’s elements.
unstream :: Stream a [a]
unstream (Stream next0s0) = unfold s0
where
unfold s =case next0sof
Done [ ]
Skip s0unfold s 0
Yield x s0x:unfold s0
In contrast to unfoldr, the Stream stepper function has one
other alternative, it can Skip, producing a new state but yielding no
new value in the sequence. This is not necessary for the semantics
but as we shall show later, is crucial for the implementation. In
particular it is what allows all stepper functions to be non-recursive.
2.2 Eliminating conversions
Writing list functions using compositions of stream and unstream
is clearly inefficient: each function must first construct a new
Stream , and when it is done, unfold the stream back to a list.
This is evident in the definition of map from the previous sec-
tion. Instead of consuming and constructing a list once: stream
consumes a list, allocating Step constructors; mapsconsumes and
allocates more Step constructors; finally, unstream consumes the
Step constructors and allocates new list nodes. However, if we
compose two functions implemented via streams:
map f ·map g =
unstream ·mapsf·stream ·unstream ·mapsg·stream
we immediately see an opportunity to eliminate the intermediate
list conversions!
Assuming stream ·unstream as the identity on streams, we
obtain the rewrite rule:
hstream/unstream fusioni ∀ s.stream (unstream s)7→ s
The Glasgow Haskell Compiler supports programmer-defined
rewrite rules (Peyton Jones et al. 2001), applied by the compiler
during compilation. We can specify the stream fusion rule as part
of the list library source code — without changing the compiler.
When the compiler applies this rule to our example, it yields:
unstream ·mapsf·mapsg·stream
Our pipeline of list transformers has now been transformed into
a pipeline of stream transformers. Externally, the pipeline still con-
sumes and produces lists, just as the direct list implementation
of map ·map does. However, internally the mapsf·mapsg
pipeline is the composition of (simple, non-recursive) stream func-
tions.
It is interesting to note that the stream /unstream rule is not
really a classical fusion rule at all. It only eliminates the list allo-
cations that were introduced in converting operations to work over
streams.
2.3 Fusing co-structures
Having converted the functions over list structures into functions
over stream co-structures, the question now is how to optimise
away intermediate Step constructors produced by composed func-
tions on streams.
The key trick is that all stream producers are non-recursive.
Once list functions have been transformed to compositions of
non-recursive stepper functions, there is an opportunity for real
fusion: the compiler can relatively easily eliminate intermediate
Step constructors produced by the non-recursive steppers, using
existing general purpose optimisations. We describe this process in
detail in Section 7.
3. Writing stream combinators
Figure 1 shows the definitions of several standard algorithms on
flat streams which we use throughout the paper. For the most part,
these definitions are essentially the same as those presented in our
previous work (Coutts et al. 2007). In the following, we discuss
some of the combinators and highlight the principles underlying
their implementation.
No recursion: filter Similarly to maps, the stepper function for
filtersis non-recursive which is crucial for producing efficient
fused code. In the case of filters, however, a non-recursive imple-
mentation is only possible by introducing Skip in place of elements
that are removed from the stream — the only alternative is to recur-
sively consume elements from the input stream until we find one
that satisfies the predicate (as is the case for the filter function in
the destroy/unfoldr system). As we are able to avoid this recursion,
we maintain trivial control flow for streams, and thus never have to
see through fixed points to optimise, yielding better code.
Consuming streams: fold The only place where recursion is al-
lowed is when we consume a stream to produce a different type.
The canonical examples of this are foldrsand foldlswhich are de-
fined in Figure 1. To understand this it is helpful to see composi-
tions of stream functions simply as descriptions of pipelines which
on their own do nothing. They require a recursive function at the
end of the pipeline to unroll sequence elements, to actually con-
struct a concrete value.
Recursion is thus only involved in repeatedly pulling values out
of the stream. Each step in the pipeline itself requires no recursion.
Of course because of the possibility that a single step might skip it
may take many steps to actually yield a value.
Complex stream states: append Many operations on streams en-
code complex control flow by using non-trivial state types. One
filters:: (aBool)Stream a Stream a
filtersp(Stream next0s0) = Stream next s0
where
next s =case next0sof
Done Done
Skip s0Skip s 0
Yield x s0|p x Yield x s0
|otherwise Skip s0
returns:: aStream a
returnsx=Stream next True
where
next True =Yield x False
next False =Done
enumFromTos:: Enum a aaStream a
enumFromTosl h =Stream next l
where
next n |n>h=Done
|otherwise =Yield n (succ n )
foldrs:: (abb)bStream a b
foldrsf z (Stream next s0) = go s0
where
go s =case next s of
Done z
Skip s0go s 0
Yield x s0f x (go s0)
foldls:: (bab)bStream a b
foldlsf z (Stream next s0) = go z s0
where
go z s =case next s of
Done z
Skip s0go z s0
Yield x s0go (f z x )s0
appends:: Stream a Stream a Stream a
appends(Stream nextasa0) (Stream nextbsb0) =
Stream next (Left sa0)
where
next (Left sa) =
case nextasaof
Done Skip (Right sb0)
Skip s0
aSkip (Left s0
a)
Yield x s0
aYield x (Left s0
a)
next (Right sb) =
case nextbsbof
Done Done
Skip s0
bSkip (Right s0
b)
Yield x s0
bYield x (Right s0
b)
zips:: Stream a Stream b Stream (a,b)
zips(Stream nextasa0) (Stream nextbsb0) =
Stream next (sa0,sb0,Nothing )
where
next (sa,sb ,Nothing ) =
case nextasaof
Done Done
Skip s0
aSkip (s0
a,sb,Nothing)
Yield a s0
aSkip (s0
a,sb,Just a)
next (s0
a,sb,Just a) =
case nextbsbof
Done Done
Skip s0
bSkip (s0
a,s0
b,Just a)
Yield b s 0
bYield (a,b) (s0
a,s0
b,Nothing)
Figure 1: Flat stream combinators
concatMaps:: (aStream b)Stream a Stream b
concatMapsf(Stream nextasa0) = Stream next (sa0,Nothing)
where
next (sa,Nothing) =
case nextasaof
Done Done
Skip s0
aSkip (s0
a,Nothing)
Yield a s0
aSkip (s0
a,Just (f a))
next (sa,Just (Stream nextbsb)) =
case nextbsbof
Done Skip (sa,Nothing)
Skip s0
bSkip (sa,Just (Stream nextbs0
b))
Yield b s 0
bYield b (sa,Just (Stream nextbs0
b))
Figure 2: Definition of concatMapson streams
example is appendswhich produces a single stream by concate-
nating two independent streams, with possibly different state types.
The state of the new stream necessarily contains the states of the
two component streams.
To implement concatenation we notice that at any moment we
need only the state of the first stream, or the state of the second. The
next function for append thus operates in two modes, either yield-
ing elements from the first stream, or elements from the second.
The two modes can then be encoded as a sum type, Either sasb,
tagging which mode the stepper is in: either yielding the first
stream, or yielding the second. The modes are thus represented
as Left saor Right sband there is one clause of next for each.
When we get to the end of the first stream we have to switch modes
so that we can start yielding elements from the second stream.
This is another example where it is convenient to use Skip. In-
stead of immediately having to yield the first element of the second
stream (which is impossible anyway since the second stream may
skip) we can just transition into a new state where we will be able
to do so. The rule of thumb is in each step to do one thing and one
thing only.
What is happening here of course is that we are using state
to encode control flow. This is the pattern used for all the more
complex stream functions. Section 7.1 explains how code in this
style is optimised.
Consuming multiple streams: zip Functions that consume mul-
tiple stream in parallel, such as zips, also require non-trivial state.
Unsurprisingly the definition of zipson streams is quite similar to
the equivalent definition in the destroy/unfoldr system. The main
difference is that the stream version has to cope with streams that
produce Skips, which complicates matters slightly. In particular, it
means that the we must cope with a situation where we have an
element from the first stream but cannot immediately (i.e., non-
recursively) obtain an element from the second one.
So rather than trying to extract an element from one stream, then
from another in a single step, we must pull from the first stream,
store the element in the state and then move into a new state where
we attempt to pull a value from the second stream. Once the second
stream has yielded a value, we can return the pair. In each call of
the next function we pull from at most one stream. Again we see
that in any single step we can do only one thing.
4. Functions on nested streams
The last major class of list functions that we need to be able to
fuse are ones that deal with nested lists. The canonical example is
concatMap, but this class also includes all the list comprehensions.
In terms of control structures, these functions represent nested
recursion and nested loops.
TJ[E|]K=return E
TJ[E|B,Q]K=guard B (TJ[E|Q]K)
TJ[E|PL,Q]K=let f P =True
f=False
g P =TJ[E|Q]K
h x =guard (f x ) (g x )
in concatMap h L
TJ[E|let decls,Q]K=let decls in TJ[E|Q]K
Figure 3: Translation scheme for list comprehensions
The ordinary list concatMap function has the type:
concatMap :: (a[b]) [a][b]
For each element of its input list it applies a function which gives
another list and it concatenates all these lists together. To define a
list concatMap that is fusible with its input and output list, and
with the function that yields a list, we will need a stream-based
concatMapswith the type:
concatMaps:: (aStream b)Stream a Stream b
To get back the list version we compose concatMapswith
stream and unstream and compose the function argument fwith
stream :
concatMap f =unstream .concatMaps(stream .f).stream
To convert a use of list concatMap to stream form we need a
fusible list consumer cand fusible list producers pand f. For c,
pand fto be fusible means that they must be defined in terms
of stream or unstream and appropriate stream consumers and
producers cs,psand fs:
c=cs.stream
p=unstream .ps
f=unstream .fs
We now compose them, expanding their definitions to re-
veal the stream and unstream conversions, and then apply the
stream /unstream fusion rule three times:
c·concatMap f ·p
=cs·stream
·unstream ·concatMaps(stream ·f)·stream
·unstream ·ps
=cs·concatMaps(stream ·f)·ps
=cs·concatMaps(stream ·unstream ·fs)·ps
=cs·concatMapsfs·ps
Actually defining concatMapson streams is somewhat tricky.
We need to get an element afrom the outer stream, then f a gives
us a new inner stream. We must drain this stream before moving
onto the next outer aelement.
There are thus two modes: one where we are trying to obtain an
element from the outer stream; and another mode in which we have
the current inner stream and are pulling elements from it. We can
represent these two modes with the state type:
(sa,Maybe (Stream b))
where sais the state type of the outer stream. The full concatMaps
definition is given in Figure 2.
5. List comprehensions
List comprehensions provide a very concise way of expressing
operations that select and combine lists. It is important to fuse them
to help achieve our goal of efficiently compiling elegant, declarative
programs. Recall our introductory example:
f n =sum [km|k[1..n],m[1..k] ]
There are two aspects to fusion of list comprehensions. One is
fusing with list generators. Obviously this is only possible when
the generator expression is itself fusible. The other aspect is elimi-
nating any intermediate lists used internally in the comprehension,
and allowing the comprehension to be fused with a fusible list con-
sumer.
The build/foldr system tackles this second aspect directly by
using a translation of comprehensions into uses of build and
foldr that, by construction, uses no intermediate lists. Furthermore,
by using foldr to consume the list generators it allows fusion there
too.
Obviously the build/foldr translation, employing build, is not
suitable for streams. The other commonly used translation (Wadler
1987) directly generates recursive list functions. For streams we
either need a translation directly into a single stream (potentially
with a very complex state and stepper function) or a translation into
fusible primitives. We opt for the second approach which makes the
translation itself simpler but leaves us with the issue of ensuring
that the expression we build really does fuse.
We use a translation very similar to the translation given in the
Haskell language specification (Peyton Jones et al. 2003). However,
there are a couple of important differences. The first change is
to always translate into list combinators, rather than concrete list
syntax. This allows us to delay expansion of these functions and
use compiler rewrite rules to turn them into their stream-fusible
counterparts.
The second change is to modify the translation so that condi-
tionals do not get in the way of fusion. The Haskell’98 translations
for expressions and generators are:
TJ[E|B,Q]K=if Bthen TJ[E|Q]Kelse [ ]
TJ[E|PL,Q]K=let ok P =TJ[E|Q]K
ok = [ ]
in concatMap ok L
Note that for the generator case, Pcan be any pattern and as such
pattern match failure is a possibility. This is why the ok function
has a catch-all clause.
We cannot use this translation directly because in both cases the
resulting list is selected on the basis of a test. We cannot directly
fuse when the stream producer is not statically known, as is the case
when we must make a dynamic choice between two streams. The
solution is to push the dynamic choice inside the stream. We use
the function guard:
guard :: Bool [a][a]
guard True xs =xs
guard False xs = [ ]
This function is quite trivial, but by using a named combinator
rather than primitive syntax it enables us to rewrite to a stream
fusible implementation:
guards:: Bool Stream a Stream a
guardsb(Stream next0s0) = Stream next (b,s0)
where
next (False,) = Done
next (True,s) = case next0sof
Done Done
Skip s0Skip (True,s0)
Yield x s0Yield x (True,s0)
The full translation is given in Figure 3. We can use guard
directly for the case of filter expressions. For generators we build a
function that uses guard with a predicate based on the generator’s
pattern.
We can now use this translation on our example. For the sake
of brevity we omit the guard functions which are trivial in this
example since both generator patterns are simple variables.
TJ[km|k[1..n],m[1..k] ] K
=concatMap (λk
concatMap (λm
return (km))
(enumFromTo 1k))
(enumFromTo 1n)
Next we inline all the list functions to reveal the stream versions
wrapped in stream /unstream and we apply the fusion rule three
times:
=unstream (concatMaps(λkstream (
unstream (concatMaps(λmstream (
unstream (returns(km))))
(stream (unstream (enumFromTos1k))))))
(stream (unstream (enumFromTos1n))))
=unstream (concatMaps(λk
concatMaps(λm
returns(km))
(enumFromTos1k))
(enumFromTos1n))
Finally, to get our full original example we apply sums(which is
just foldls(+) 0) and repeat the inline and fuse procedure one more
time. This gives us a term with no lists left; the entire structure has
been converted to stream form.
=sums(concatMaps(λk
concatMaps(λm
returns(km))
(enumFromTos1k))
(enumFromTos1n))
6. Correctness
Every fusion framework should come with a rigorous correctness
proof. Unfortunately, many do not and ours is not an exception.
This might seem surprising at first, as we introduce only one rather
simple rewrite rule:
s.stream (unstream s)7→ s
Should it not be easy to show that applying this rule does not
change the semantics of a program or, conversely, construct an ex-
ample where the semantics is changed? In fact, a counterexample
is easily found for the system presented in this paper: with s=,
we have:
stream (unstream ) = Stream next ⊥ 6=
Depending on how we define equivalence on streams, other
counterexamples can be derived. In the rest of this section we
discuss possible approaches to retaining semantic soundness of
stream fusion.
6.1 Strictness of streams
The above counter-example is particularly unfortunate as it implies
that we can turn terminating programs into non-terminating ones.
In our implementation, we circumvent this problem by not export-
ing the Stream data type and ensuring that we never construct bot-
tom streams within our library. Effectively, this means that we treat
Stream as an unlifted type, even though Haskell does not provide
us with the means of saying so explicitly.1
Avoiding the creation of bottom streams is, in fact, fairly easy. It
boils down to the requirement that all stream-constructing functions
be non-strict in all arguments except those of type Stream which
we can presume not to be bottom. This is always possible, as the
1Launchbury and Paterson (1996) discuss how unlifted types can be inte-
grated into a lazy language.
arguments can be evaluated in the stepper function. For instance,
the combinator guard defined in the previous section is lazy in the
condition. The latter is not inspected until the stepper function has
been called for the first time.
In fact, we can easily change our framework such that the
rewrite rule removes bottoms instead of introducing them. For this,
it is sufficient to make stream strict in its argument. Then, we
have stream (unstream ) = . However, now we can derive
a different counterexample:
stream (unstream (Stream s)) = ⊥ 6=Stream s
This is much less problematic, though, as it only means that we
turn some non-terminating programs into terminating ones. Unfor-
tunately, with this definition of stream it becomes much harder to
implement standard Haskell list functions such that they have the
desired semantics. The Haskell 98 Report (Peyton Jones et al. 2003)
requires that take 0xs = [ ], i.e., take must be lazy in its second
argument. In our library, take is implemented as:
take :: Int [a][a]
take n xs =unstream (takesn(stream xs))
takes:: Int Stream a Stream a
takesn(Stream next s ) = Stream next0(n,s)
where
next0(0,s) = Done
next0(n,s) = case next s of
Done Done
Skip s0Skip (n,s0)
Yield x s0Yield x (n1,s0)
Note that since takesis strict in the stream argument, stream
must be lazy if take is to have the required semantics. An alterna-
tive would be to make takeslazy in the stream:
takesn s =Stream next 0(n,s)
where
next0(0,s) = Done
next0(n,Stream next s) =
case next s of
Done Done
Skip s0Skip (n,Stream next s0)
Yield x s0Yield x (n1,Stream next s0)
Here, we embed the entire argument stream in the seed of
the newly constructed stream, thus ensuring that it is only eval-
uated when necessary. Unfortunately, such code is currently less
amenable to being fully optimised by GHC. Indeed, efficiency was
why we preferred the less safe fusion framework presented in this
paper to the one outlined here. We do hope, however, that improve-
ments to GHC’s optimiser will allow us to experiment with alter-
natives in the future.
6.2 Equivalence of streams
Even in the absence of diverging computations, it is not entirely
trivial to define a useful equivalence relation on streams. This is
mainly due to the fact that a single list can be modeled by infinitely
many streams. Even if we restrict ourselves to streams producing
different sequences of Step values, there is still no one-to-one
correspondence — two streams representing the same list can differ
in the number and positions of Skip values they produce. This
suggests that equivalence on streams should be defined modulo
Skip values. In fact, this is a requirement we place on all stream-
processing functions: their semantics should not be affected by the
presence or absence of Skip values.
6.3 Testing
Although we do not have a formal proof of correctness of our
framework, we have tested it quite extensively. It is easy to intro-
duce subtle strictness bugs when writing list functions, either di-
rectly on lists or on streams. Fortunately we have a precise specifi-
cation in the form of the Haskell’98 report. Comparative testing on
total values is relatively straightforward, but to test strictness prop-
erties however we need to test on partial values. We were inspired
by the approach in StrictCheck (Chitil 2006) of checking strict-
ness properties by generating all partial values up to a certain finite
depth. However, to be able to generate partial values at higher or-
der type we adapted SmallCheck (Runciman 2006) to generate all
partial rather than total values up to any given depth. We used this
and the Chasing Bottoms library (Danielsson and Jansson 2004) to
compare our implementations against the Haskell’98 specification
and against the standard library used by many Haskell implemen-
tations.
This identified a number of subtle bugs in our implementation
and a handful of cases where we can argue that the specification
is unnecessarily strict. We also identified cases where the standard
library differs from the specification. The tests document the strict-
ness properties of list combinators and give us confidence that the
stream versions do, in fact, have the desired strictness.
7. Compiling stream code
Ultimately, a fusion framework should eliminate temporary data
structures. Stream fusion by itself does not, however, reduce allo-
cation - it merely replaces intermediate lists by intermediate Step
values. Moreover, when a stream is consumed, additional alloca-
tions are necessary to maintain its seed throughout the loop. For
instance, append allocates an Either node in each iteration.
This behaviour is quite similar to programs produced by de-
stroy/unfoldr and like the latter, our approach relies on subsequent
compiler optimisation passes to eliminate these intermediate val-
ues. Since we consider more involved list operations than Sven-
ningsson (2002), in particular nested ones, we necessarily require
more involved optimisation techniques than the ones discussed
in that work. Still, these techniques are generally useful and not
specifically tailored to programs produced by our fusion frame-
work. In this section, we identify the key optimisations necessary
to produce good code for stream-based programs and explain why
they are sufficient to replace streams by nothing at all.
7.1 Flat pipelines
Let us begin with a simple example: sum (xs ++ ys). Our fusion
framework rewrites this to:
foldls(+) 0 (appends(stream xs ) (stream ys))
Inlining the definitions of the stream combinators, we get
let nextstream xs =
case xs of
[ ] Done
x:xs0Yield x xs0
nextappend (Left xs ) =
case nextstream xs of
Done Skip (Right ys)
Skip xs0Skip (Left xs0)
Yield x xs0Yield x (Left xs0)
nextappend (Right ys) =
case nextstream ys of
Done Done
Skip ys0Skip (Right ys 0)
Yield y ys0Yield y (Right ys0)
go z s =
case nextappend sof
Done z
Skip s0go z s0
Yield x s0go (z+x)s0
in go 0 (Left xs)
Here, nextstream and nextappend are the stepper functions of the
corresponding stream combinators and go the stream consumer of
foldls.
While this loop is rather inefficient, it can be easily optimised
using entirely standard techniques such as those described by Pey-
ton Jones and Santos (1998). By inlining nextstream into the first
branch of nextappend, we get a nested case distinction:
nextappend (Left xs ) =
case
case xs of
[ ] Done
x:xs0Yield x xs0
of
Done Skip (Right ys)
Skip xs0Skip (Left xs0)
Yield x xs0Yield x (Left xs0)
This term are easily improved by applying the case-of-case trans-
formation which pushes the outer case into the alternatives of the
inner case:
nextappend (Left xs ) =
case xs of
[ ] case Done of
Done Skip (Right ys)
Skip xs0Skip (Left xs0)
Yield x xs0Yield x (Left xs0)
x:xs0case Yield x xs0of
Done Skip (Right ys)
Skip xs0Skip (Left xs0)
Yield x xs0Yield x (Left xs0)
This code trivially rewrites to:
nextappend (Left xs ) = case xs of
[ ] Skip (Right ys)
x:xs0Yield x (Left xs0)
The Right branch of nextappend is simplified in a similar man-
ner, resulting in
nextappend (Right ys) = case ys of
[ ] Done
y:ys0Yield y (Right ys 0)
Note how by inlining, applying the case-of-case transforma-
tion and then simplifying we have eliminated the construction (in
nextstream) and inspection (in nextappend ) of one Step value per it-
eration. The good news is that these techniques are an integral part
of GHC’s optimiser and are applied to our programs automatically.
Indeed, the optimiser then inlines nextappend into the body of go
and reapplies the transformations described above to produce:
let go z (Left xs) = case xs of
[ ] go z (Right ys)
x:xs0go (z+x) (Left xs0)
go z (Right ys ) = case ys of
[ ] z
y:ys0go (z+y) (Right ys 0)
in go 0 (Left xs)
While this loop does not use any intermediate Step values,
it still allocates Left and Right for maintaining the loop state.
Eliminating these requires more sophisticated techniques than we
have used so far. Fortunately, constructor specialisation (Peyton
Jones 2007), an optimisation which has been implemented in GHC
for some time, does precisely this. It analyses the shapes of the
arguments in recursive calls to go and produces two specialised
versions of the function, go1and go2, which satisfy the following
equivalences:
z xs.go z (Left xs) = go1z xs
z ys.go z (Right ys) = go2z ys
The compiler then replaces calls to go by calls to a specialised
version whenever possible. The definitions of the two specialisa-
tions are obtained by expanding go once in each of the above two
equations and simplifying, which ultimately results in the follow-
ing program:
let go1z xs =case xs of
[ ] go2z ys
x:xs0go1(z+x)xs 0
go2z ys =case ys of
[ ] z
y:ys0go2(z+y)ys 0
in go10xs
Note that the original version of go is no longer needed. The
loop has effectively been split in two parts — one for each of
the two concatenated lists. Indeed, this result is the best we could
have hoped for. Not only have all intermediate data structures been
eliminated, the loop has also been specialised for the algorithm at
hand.
By now, it becomes obvious that in order to compile stream pro-
grams to efficient code, all stream combinators must be inlined and
subsequently specialised. Arguably, this is a weakness of our ap-
proach, as this sometimes results in excessive code duplication and
a significant increase in the size of the generated binary. However,
as discussed in Section 8, our experiments suggest that this increase
is almost always negligible.
7.2 Nested computations
So far, we have only considered the steps necessary to optimise
fused pipelines of flat operations on lists. For nested operations
such as concatMap, the story is more complicated. Nevertheless, it
is crucial that such operations are optimised well. Indeed, even our
introductory example uses concatMap under the hood as described
in Section 5.
Although a detailed explanation of how GHC derives the effi-
cient loop presented in the introduction would take up too much
space, we can investigate a less complex example which, neverthe-
less, demonstrates the basic principles underlying the simplification
of nested stream programs. In the following, we consider the simple
list comprehension sum [mm|m[1..n]]. After desugaring
and stream fusion, the term is transformed to (we omit the trivial
guard):
foldls(+) 0 (concatMaps(λm.returns(mm))
(enumFromTos1n))
After inlining the definitions of the stream functions, we arrive
at the following loop (nextenum ,nextcm and nextret are the stepper
functions of enumFromTos,concatMapsand returns, respec-
tively, as defined in Figures 1 and 2 ):
let nextenum i|i>n=Done
|otherwise =Yield i (i+ 1)
nextconcatMap (i,Nothing) =
case nextenum iof
Done Done
Skip i0Skip (i0,Nothing)
Yield x i 0let nextret True =Yield (xx)False
nextret False =Done
in
Skip (i0,Just (Stream nextret True))
nextconcatMap (i,Just (Stream next s)) =
case next s of
Done Skip (i,Nothing)
Skip s0Skip (i,Just (Stream next s0))
Yield y s0Yield y (i,Just (Stream next s 0))
go z s =case nextconcatMap sof
Done z
Skip s0go z s0
Yield x s0go (z+x)s0
in go 0 (1,Nothing)
As before, we now inline nextenum and nextconcatMap into the
body of go and repeatedly apply the case-of-case transformation.
Ultimately, this produces the following loop:
let go z (i,Nothing )|i>n=z
|otherwise =
let nextret True =Yield (ii)False
nextret False =Done
in
go z (i+ 1,Just (Stream nextret True))
go z (i,Just (Stream next s)) =
case next s of
Done go z (i,Nothing )
Skip s0go z (i,Just (Stream next s 0))
Yield x s0go (z+x) (i,Just (Stream next s 0))
in go 0 (1,Nothing)
Now we again employ constructor specialisation to split go into
two mutually recursive functions go1and go2such that
z i.go z (i,Nothing ) = go1z i
z i next s .go z (i,Just (Stream next s)) = go2z i next s
The second specialisation is interesting in that it involves an exis-
tential component — the state of the stream. Thus, go2must have
a polymorphic type which, however, is easily deduced by the com-
piler. After simplifying and rewriting calls to go, we arrive at the
following code:
let go1z i |i>n=z
|otherwise =
let nextret True =Yield (ii)False
nextret False =Done
in
go2z(i+ 1) nextret True
go2z i next s =case next s of
Done go1z i
Skip s0go2z i next s0
Yield x s0go2(z+x)i next s0
in go10 1
The loop has now been split into two mutually recursive func-
tions. The first, go1, computes the next element iof the enumer-
ation [1..n]and then passes it to go2which computes the product
and adds it to the accumulator z. However, the nested structure of
the original loop obscures this simple algorithm. In particular, the
stepper function nextret of the stream produced by returnshas to
be passed from go1, where it is defined, to go2, where it is used. If
we are to produce efficient code, we must remove this indirection
and allow nextret to be inlined in the body of go2. In the following,
we consider two approaches to solving this problem: static argu-
ment transformation and specialisation on partial applications.
Static argument transformation It is easy to see that next and i
are static in the definition of go2, i.e., they do not change between
iterations. An optimising compiler can take advantage of this fact
and eliminate the unnecessary arguments:
go2z i next s =
let go0
2z s =case next s of
Done go1z i
Skip s0go 0
2z s0
Yield x s0go0
2(z+x)s0
in go0
2z s
With this definition, go2can be inlined in the body of go1. Subse-
quent simplification binds next to nextret and allows the latter to
be inlined in go0
2:
go1z i |i>n=z
|otherwise =
let go0
2z True =go 0
2(z+ii)False
go0
2z False =go1z(i+ 1)
in
go0
2z True
The above can now be easily rewritten to the optimal loop:
go1z i |i>n=z
|otherwise =go1(z+ii) (i+ 1)
Note how the original nested loop has been transformed into
a flat one. This is only possible because in this particular exam-
ple, the function argument of concatMap was not itself recursive.
More complex nesting structures, in particular nested list compre-
hensions, are translated into nested loops if the static argument
transformation is employed. For instance, our introductory example
would be compiled to
let go1z k |k>n=z
|otherwise =
let go2z m |m>k=go1z(k+ 1)
|otherwise =go2(z+km) (m+ 1)
in go2z1
in go10 1
Specialisation An alternative approach to optimising the program
is to lift the definition of nextret out of the body of go1according
to the algorithm of Johnsson (1985):
nextret i True =Yield (ii)False
nextret i False =Done
go1z i |i>n=z
|otherwise =go2z(i+ 1) (nextret i)True
Now, we can once more specialise go2for this call; but this time,
in addition to constructors we also specialise on the partial appli-
cation of the now free function nextret, producing a go3such that:
z i j .go2z j (nextret i)True =go3z j i
After expanding go2once in the above equation, we arrive at the
following unoptimised definition of go3:
go3z j i =case nextret i True of
Done go1z j
Skip s0go2z j (nextret i)s0
Yield x s0go2(z+x)j(nextret i)s0
Note that the stepper function is now statically known and can
be inlined which allows all case distinctions to be subsequently
eliminated, leading to a quite simple definition:
go3z j i =go2(z+ (ii)) j(nextret i)False
The above call gives rise to yet another specialisation of go2:
z i j .go2z j (nextret i)False =go4z j i
Again, we rewrite go4by inlining nextret and simplifying, ulti-
mately producing:
go1z i |i>n=z
|otherwise =go3z(i+ 1) i
go3z j i =go4(z+ (ii)) j i
go4z j i =go1z j
This is trivially rewritten to exactly the same code as has been
produced by the static argument transformation:
go1z i|i>n=z
|otherwise =go1(z+ii) (i+ 1)
This convergence of the two optimisation techniques is, how-
ever, only due to the simplicity of our example. For the more deeply
nested program from the introduction, specialisation would pro-
duce two mutually recursive functions:
go1z k |k>n=z
|otherwise =go2z k (k+ 1) 1
go2z k k 0m|m>k=go1z k 0
|otherwise =go2(z+km)k k0(m+ 1)
This is essentially the code of f0from the introduction; the
only difference is that GHC’s optimiser has unrolled go2once and
unboxed all loop variables. This demonstrates the differences be-
tween the two approaches nicely. The static argument transforma-
tion translates nested computations into nested recursive functions.
Specialisation on partial applications, on the other hand, produces
flat loops with several mutually recursive functions. The state of
such a loop is maintained in explicit arguments.
Unfortunately, GHC currently supports neither of the two ap-
proaches — it only specialises on constructors but not on partial
applications of free functions and does not perform the static argu-
ment transformation. Although we have extended GHC’s optimiser
with both techniques, our implementation is quite fragile and does
not always produce the desired results. Indeed, missed optimisa-
tions are at least partly responsible for many of the performance
problems discussed in Section 8. At this point, the implementation
must be considered merely a proof of concept. We are, however,
hopeful that GHC will be able to robustly optimise stream-based
programs in the near future.
8. Results
We have implemented the entire Haskell standard List library, in-
cluding enumerations and list comprehensions, on top of our stream
fusion framework. Stream fusion is implemented via equational
transformations embedded as rewrite rules in the library source
code. We compare time, space, code size and fusion opportuni-
ties for programs in the nofib benchmark suite (Partain 1992),
when compared to the existing build/foldr system. To ensure a
fair comparison, both frameworks have been benchmarked with
our extensions to GHC’s optimiser (cf. Section 7) enabled. For the
build/foldr framework, these extensions do not significantly affect
the running time and allocation behaviour, usually improving them
slightly, and without them, nested concatMap’s under stream fu-
sion risk not being optimised.
8.1 Time
Figure 4 presents the relative speedups for Haskell programs from
the nofib suite, when compared to the existing build/foldr system.
On average, there is a 3% improvement when using stream fusion,
with 6 of the test programs more than 15% faster, and one, the ‘in-
teger’ benchmark, more than 50% faster. One program, ‘paraffins’,
ran 24% slower, due to remnant Stream data constructors not stat-
ically removed by the compiler.
In general we can divide the results into three classes of pro-
grams:
1. those for which there is plenty of opportunity for fusion which
is under-exploited by build/foldr;
2. programs for which there is little fusion or for which the fusion
is in a form handled by build/foldr;
3. and thirdly, programs such as ‘paraffins’ with deeply nested
list computations and comprehensions which overtax our ex-
tensions to GHC’s optimiser.
For the first class or programs, those using critical left folds and
zip, stream fusion can be a big win. 10% (and sometimes much
more) improvement is not uncommon. This corresponds to around
15% of programs tested.
In the second case, the majority of programs covered, there is
either little available fusion, or the fusion is in the form of right
folds, and list comprehensions, already well handled by build/foldr.
Only small improvements can be expected here.
Finally, the third class, corresponds to some 5% of programs
tested. These programs have available fusion, but in deeply nested
form, which can lead to Step constructors left behind by limitations
in current GHC optimisations, rather than being removed statically.
These programs currently will run dramatically, and obviously,
worse.
For large multi-module programs, the results are less clear, with
just as many programs speeding up as slowing down. We find that
for larger programs, GHC has a tendency to miss optimisation op-
portunities for stream fusible functions across module boundaries,
which is the subject of further investigation.
8.2 Space
Figure 5 presents the relative reduction in total heap allocations
for stream fusion programs compared to the existing build/foldr
system. The results can again be divided into the same three classes
as for the time benchmarks: those with under-exploited fusion
opportunities, those for which build/foldr already does a good job,
and those for which Step artifacts are not statically eliminated by
the compiler.
For programs which correctly fuse, in the first class, with new
fusion opportunities found by stream fusion, there can be dramatic
reductions in allocations (up to 30%). Currently, this is the minor-
ity of programs. The majority of programs have modest reductions,
with an average decrease in allocations of 4.4%. Two programs
have far worse allocation performance, however, due to missed op-
portunities to remove Step constructors in nested code. For large,
multi-module programs, we find a small increase in allocations, for
similar reasons as for the time benchmarks.
8.3 Fusion opportunities
In Figure 6 we compare the number of fusion sites identified with
stream fusion, compared to that with build/foldr. In the majority of
cases, more fusion sites are identified, corresponding to new fusion
opportunities with zips and left folds. Similar results are seen when
compiling GHC itself, with around 1500 build/foldr fusion sites
identified by the compiler, and more than 3000 found under stream
fusion.
8.4 Code size
Total compiled binary size was measured, and we find that for
single module programs, code size increases by a negligible 2.5%
on average. For multi-module programs, code size increases by
11%. 5% of programs increased by more than 25% in size, again
due to unremoved specialised functions and Step constructors.
9. Further work
9.1 Improved optimisations
The main direction for future work on stream fusion is to improve
further the compiler optimisations required to remove Step constructors
statically, as described in Section 7.
Another possible approach to reliably fusing nested uses of
concatMap is to define a restricted version of it which assumes
that the inner stream is constructed in a uniform way, i.e., using the
same stepper function and the same function to construct initial
inner-stream states in every iteration of the outer stream. This
situation corresponds closely to the forms that we expect to be able
to optimise with the static argument transformation.
The aim would be to have a rule that matches the common situ-
ation where this restricted concatMap can be used. Unfortunately
such rules cannot be expressed in the current GHC rules language.
A more powerful rule matcher would allow us to write something
like:
concatMap (λxunstream (Stream nextJxKsJxK))
=concatMap0(λynextJyK) (λysJyK)
-30
-20
-10
0
10
20
30
binarytrees
nsieve
partialsums
pidigits
spectral
sumcol
bernouilli
digits-of-e1
integrate
queens
loop
digits-of-e2
exp3-8
gen-regexps
paraffins
rfib
wheelsieve1
wheelsieve2
x2n1
jl-rsa
ru-list
ansi
boyer
fibheaps
integer
k-nucleotide
mandelbrot
nbody
regex-dna
reverse
sorting
meteor
calendar
cichelli
circsim
clausify
constraints
cryptarithm1
fannkuch
primes
perfectsqs
eliza
Percentage speedup
Figure 4: Percentage improvement in running time compared to build/foldr fusion
-30
-20
-10
0
10
20
30
binarytrees
nsieve
partialsums
pidigits
spectral
sumcol
bernouilli
digits-of-e1
integrate
queens
loop
digits-of-e2
exp3-8
gen-regexps
paraffins
rfib
wheelsieve1
wheelsieve2
x2n1
jl-rsa
ru-list
ansi
boyer
fibheaps
integer
k-nucleotide
mandelbrot
nbody
regex-dna
reverse
sorting
meteor
calendar
cichelli
circsim
clausify
constraints
cryptarithm1
fannkuch
primes
perfectsqs
eliza
Percentage reduction in allocations
Figure 5: Percent reduction in allocations compared to build/foldr fusion
-15
-10
-5
0
5
10
15
20
25
30
binarytrees
nsieve
partialsums
pidigits
spectral
sumcol
bernouilli
digits-of-e1
integrate
queens
loop
digits-of-e2
exp3-8
gen-regexps
paraffins
rfib
wheelsieve1
wheelsieve2
x2n1
jl-rsa
ru-list
ansi
boyer
fibheaps
integer
k-nucleotide
mandelbrot
nbody
regex-dna
reverse
sorting
meteor
calendar
cichelli
circsim
clausify
constraints
cryptarithm1
fannkuch
primes
perfectsqs
eliza
Improvement in found fusion sites
Figure 6: New fusion opportunities found when compared to build/foldr
Here, TJxKmatches a term Tabstracted over free occurrences
of the variable x. In the right hand side the same syntax indicates
substitution.
The key point is that the stepper function of the inner stream is
statically known in concatMap0. This is a much more favourable
situation compared to embedding the entire inner stream in the seed
of concatMap. Indeed, extending GHC’s rule matching capabil-
ities in this direction might be easier than robustly implementing
the optimisations outlined in Section 7.
9.2 Fusing general recursive definitions
Writing stream stepper functions is not always easy. The represen-
tation of control flow as state makes them appear somewhat inside
out. One technique we found useful when translating Haskell list
functions into stream fusible versions was to first transform the list
version to very low level Haskell. From this form, where the pre-
cise control flow is clear, there is a fairly direct translation into a
stream version.
For example here is a list function written in a very low level
style using three mutually tail-recursive functions. Each one has
only simple patterns on the left hand side and on the right hand
side: an empty list; a call; or a cons and a call.
intersperse :: a[a][a]
intersperse sep xs0=init xs0
where
init xs =case xs of
[ ] [ ]
(x:xs)go x xs
go x xs =x:to xs
to xs =case xs of
[ ] [ ]
(x:xs)sep :go x xs
We can translate this to an equivalent function on streams by
making a data type with one constructor per function. Each con-
structor holds the arguments to that function except that arguments
of type list are replaced by the stream state type. In the body of each
function, case analysis on lists is replaced by calling next0on the
stream state. Consing an element onto the result is replaced by uses
of Yield . Calls are replaced by Skips with the appropriate state
data constructor:
data State a s =Init s |Go a s |To s
intersperses:: aStream a Stream a
interspersessep (Stream next0s0) = Stream next (Init s0)
where
next (Init s) = case next0sof
Done Done
Skip s0Skip (Init s 0)
Yield x s0Skip (Go x s 0)
next (Go x s ) = Yield x (To s)
next (To s) = case next0sof
Done Done
Skip s0Skip (To s0)
Yield x s0Yield sep (Go x s0)
It would be interesting to investigate the precise restrictions on
the form which can be translated in this way and whether it can
be automated. This might provide a practical way to fuse general
recursive definitions over lists: by checking if the list function can
be translated to the restricted form and then translating into a stream
version. There is some precedent for this approach: Launchbury
and Sheard (1995) show that in many common cases it is possible to
transform general recursive definitions on lists into a form suitable
for use with ordinary build/foldr short-cut fusion.
9.3 Fusing more general algebraic data types
It seems straightforward to define a co-structure for any sum-of-
products data structure. Consider for example a binary tree type
with information in both the leaves and interior nodes:
data Tree a b =Leaf a |Fork b (Tree a b) (Tree a b)
The corresponding co-structure would be:
data Stream a b =s.Stream (sStep a b s )s
data Step a b s =Leafsa|Forksb s s |Skip s
Of course other short-cut fusion systems can also be generalised
in this way but in practice they are not because it requires defining a
new infrastructure for each new data structure that we wish to fuse.
Automation would be required to make this practical. This problem
is somewhat dependent on the ability to generate stream style code
from ordinary recursive definitions.
10. Conclusion
It is possible, via stream fusion, to automatically fuse a complete
range of list functions, beyond that of previous short-cut fusion
techniques. In particular, it is possible to fuse left and right folds,
zips, concats and nested lists, including list comprehensions. For
the first time, details are provided for the range of general purpose
optimisations required to generate efficient code for unfoldr-based
short-cut fusion.
Stream fusion is certainly practical, with a full scale implemen-
tation for Haskell lists being implemented, utilising library-based
rewrite rules for fusion. Our results indicate there is a greater op-
portunity for fusion, than under the existing build/foldr system, and
also show moderate improvements in space and time performance.
Further improvements in the specific compiler optimisations re-
quired to remove fusion artifacts statically.
The source code for the stream fusion List library, and modified
standard Haskell library and compiler, are available online.2
Acknowledgments We are indebted to Simon Peyton Jones for
clarifying and extending GHC’s optimiser, which greatly assisted
this effort. We are also grateful to Manuel Chakravarty and Spencer
Janssen for feedback on drafts and to the four anonymous referees
for their invaluable comments.
References
Olaf Chitil. Type inference builds a short cut to deforestation. In
ICFP ’99: Proceedings of the fourth ACM SIGPLAN Interna-
tional Conference on Functional programming, pages 249–260,
New York, NY, USA, 1999. ACM Press.
Olaf Chitil. Promoting non-strict programming. In Draft Pro-
ceedings of the 18th International Symposium on Implementa-
tion and Application of Functional Languages, IFL 2006, pages
512–516, Budapest, Hungary, September 2006. Eotvos Lorand
University.
Duncan Coutts, Don Stewart, and Roman Leshchinskiy. Rewrit-
ing Haskell strings. In Practical Aspects of Declarative Lan-
guages 8th International Symposium, PADL 2007, pages 50–64.
Springer-Verlag, January 2007.
Nils Anders Danielsson and Patrik Jansson. Chasing bottoms, a
case study in program verification in the presence of partial
and infinite values. In Dexter Kozen, editor, Proceedings of
the 7th International Conference on Mathematics of Program
Construction, MPC 2004, volume 3125 of LNCS, pages 85–109.
Springer-Verlag, July 2004.
2This material can be found on the website accompanying this paper at
http://www.cse.unsw.edu.au/~dons/papers/CLS07.html.
Jeremy Gibbons. Streaming representation-changers. In D. Kozen,
editor, Mathematics of Program Construction, pages 142–168.
Springer-Verlag, 2004. LNCS 523.
Jeremy Gibbons and Geraint Jones. The under-appreciated unfold.
In ICFP ’98: Proceedings of the third ACM SIGPLAN Interna-
tional Conference on Functional programming, pages 273–279,
New York, NY, USA, 1998. ACM Press.
Andrew Gill. Cheap Deforestation for Non-strict Functional Lan-
guages. PhD thesis, University of Glasgow, January 1996.
Andrew Gill, John Launchbury, and Simon Peyton Jones. A short
cut to deforestation. In Conference on Functional Programming
Languages and Computer Architecture, pages 223–232, June
1993.
Zhenjiang Hu, Hideya Iwasaki, and Masato Takeichi. Deriving
structural hylomorphisms from recursive definitions. In Pro-
ceedings 1st ACM SIGPLAN International Conference on Func-
tional Programming, volume 31(6), pages 73–82. ACM Press,
New York, 1996.
Patricia Johann. Short cut fusion: Proved and improved. In SAIG
2001: Proceedings of the Second International Workshop on Se-
mantics, Applications, and Implementation of Program Genera-
tion, pages 47–71, London, UK, 2001. Springer-Verlag.
Thomas Johnsson. Lambda lifting: transforming programs to recur-
sive equations. In Functional programming languages and com-
puter architecture. Proc. of a conference (Nancy, France, Sept.
1985), New York, NY, USA, 1985. Springer-Verlag Inc.
John Launchbury and Ross Paterson. Parametricity and unboxing
with unpointed types. In European Symposium on Program-
ming, pages 204–218, 1996.
John Launchbury and Tim Sheard. Warm fusion: deriving build-
catas from recursive definitions. In FPCA ’95: Proceedings of
the seventh international conference on Functional program-
ming languages and computer architecture, pages 314–323,
New York, NY, USA, 1995. ACM Press.
Erik Meijer, Maarten Fokkinga, and Ross Paterson. Functional
programming with bananas, lenses, envelopes and barbed wire.
In J. Hughes, editor, Proceedings 5th ACM Conf. on Func-
tional Programming Languages and Computer Architecture,
FPCA’91, Cambridge, MA, USA, 26–30 Aug 1991, volume 523,
pages 124–144. Springer-Verlag, Berlin, 1991.
Will Partain. The nofib benchmark suite of Haskell programs. In
Functional Programming, pages 195–202, 1992.
Simon Peyton Jones. Constructor Specialisation for Haskell Pro-
grams, 2007. Submitted for publication.
Simon Peyton Jones and Andr´
e L. M. Santos. A transformation-
based optimiser for Haskell. Sci. Comput. Program., 32(1-3):
3–47, 1998. ISSN 0167-6423.
Simon Peyton Jones, Andrew Tolmach, and Tony Hoare. Playing
by the rules: rewriting as a practical optimisation technique in
GHC. In Ralf Hinze, editor, 2001 Haskell Workshop. ACM
SIGPLAN, September 2001.
Simon Peyton Jones et al. The Haskell 98 language and libraries:
The revised report. Journal of Functional Programming, 13(1):
0–255, Jan 2003.
Colin Runciman. SmallCheck 0.2: another lightweight testing
library in Haskell. http://article.gmane.org/gmane.
comp.lang.haskell.general/14461, 2006.
Josef Svenningsson. Shortcut fusion for accumulating parameters
& zip-like functions. In ICFP ’02: Proceedings of the seventh
ACM SIGPLAN International Conference on Functional pro-
gramming, pages 124–132, New York, NY, USA, 2002. ACM
Press.
Akihiko Takano and Erik Meijer. Shortcut deforestation in calcu-
lational form. In Conf. Record 7th ACM SIGPLAN/SIGARCH
Int. Conf. on Functional Programming Languages and Com-
puter Architecture, FPCA’95, pages 306–313. ACM Press, New
York, 1995.
The GHC Team. The Glasgow Haskell Compiler (GHC). http:
//haskell.org/ghc, 2007.
Philip Wadler. List comprehensions. In Simon Peyton Jones, ed-
itor, The implementation of functional programming languages.
Prentice Hall, 1987. Chapter 15.
Philip Wadler. Deforestation: transforming programs to eliminate
trees. Theoretical Computer Science, (Special issue of selected
papers from 2nd European Symposium on Programming), 73(2):
231–248, 1990.
... We reduce memory accesses of SLPs by employing deforestation, an optimization method of functional program [32,41,100]. ...
... Deforestation has deep theoretical backgrounds [22,32,41,96,99,100]; however, we need a trivial idea for SLPs. Let us consider the following SLP and the corresponding program: ...
... The first source is cache conflicts in cache sets, and it prevents cache from holding 32 /B blocks. Generally, the 32K bytes cache with 8 cache associativity has 32 8 = 4 cache sets where each cache set can hold 8 cache blocks. Accessing a cache block whose start address is A ( ), CPU tries to assign to the (A ( ) mod 4 )-th cache set. ...
Preprint
Erasure coding (EC) affords data redundancy for large-scale systems. XOR-based EC is an easy-to-implement method for optimizing EC. This paper addresses a significant performance gap between the state-of-the-art XOR-based EC approach (with 4.9 GB/s coding throughput) and Intel's high-performance EC library based on another approach (with 6.7 GB/s). We propose a novel approach based on our observation that XOR-based EC virtually generates programs of a Domain Specific Language for XORing byte arrays. We formalize such programs as straight-line programs (SLPs) of compiler construction and optimize SLPs using various optimization techniques. Our optimization flow is three-fold: 1) reducing operations using grammar compression algorithms; 2) reducing memory accesses using deforestation, a functional program optimization method; and 3) reducing cache misses using the (red-blue) pebble game of program analysis. We provide an experimental library, which outperforms Intel's library with 8.92 GB/s throughput.
... Loop Fusion. Functional languages use deforestation [23,32,92,100] to remove unnecessary intermediate collections. This optimization is implemented by rewrite rule facilities of GHC [42] in Haskell [32], and using multi-stage programming to implement them as a library has been proposed [43,49,77]. ...
... Loop Fusion. One of the essential optimizations for collection programs is deforestation[23,32,92,100]. This optimization can remove an unnecessary intermediate collection in a vertical pipeline of operators, and is thus named as vertical loop fusion. ...
Preprint
This paper introduces semi-ring dictionaries, a powerful class of compositional and purely functional collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We develop SDQL, a statically typed language centered around semi-ring dictionaries, that can encode expressions in relational algebra with aggregations, functional collections, and linear algebra. Furthermore, thanks to the semi-ring algebraic structures behind these dictionaries, SDQL unifies a wide range of optimizations commonly used in databases and linear algebra. As a result, SDQL enables efficient processing of hybrid database and linear algebra workloads, by putting together optimizations that are otherwise confined to either database systems or linear algebra frameworks. Through experimental results, we show that a handful of relational and linear algebra workloads can take advantage of the SDQL language and optimizations. Overall, we observe that SDQL achieves competitive performance to Typer and Tectorwise, which are state-of-the-art in-memory systems for (flat, not nested) relational data, and achieves an average 2x speedup over SciPy for linear algebra workloads. Finally, for hybrid workloads involving linear algebra processing over nested biomedical data, SDQL can give up to one order of magnitude speedup over Trance, a state-of-the-art nested relational engine.
... ta tc f ta f tc where d :: td → f td. Fusion laws combine a funciton after or before a hylomorphism into one hylomorphism, and thus it is widely used for optimisation (Coutts et al., 2007). We demonstrate how these calculational properties can be used to reason about programs with an example. ...
Preprint
Full-text available
This document is a 'dictionary' of structured recursion schemes in functional programming. Each recursion scheme is motivated and explained with concrete programming examples in Haskell, and the dual of each recursion scheme is also presented.
... The problem of computing without deserializing can be viewed as a fusion/deforestation problem: to fuse the compute loop with the deserialize loop. But traditional deforestation approaches [30], don't rise to being able to handle a full deserializer, and popular approaches based on more restrictive combinator libraries [7] are less expressive than HiCal and LoCal. ...
Conference Paper
In a typical data-processing program, the representation of data in memory is distinct from its representation in a serialized form on disk. The former has pointers and arbitrary, sparse layout, facilitating easy manipulation by a program, while the latter is packed contiguously, facilitating easy I/O. We propose a language, LoCal, to unify in-memory and serialized formats. LoCal extends a region calculus into a location calculus, employing a type system that tracks the byte-addressed layout of all heap values. We formalize LoCal and prove type safety, and show how LoCal programs can be inferred from unannotated source terms.
Article
This paper introduces semi-ring dictionaries, a powerful class of compositional and purely functional collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We developed SDQL, a statically typed language that can express relational algebra with aggregations, linear algebra, and functional collections over data such as relations and matrices using semi-ring dictionaries. Furthermore, thanks to the algebraic structure behind these dictionaries, SDQL unifies a wide range of optimizations commonly used in databases (DB) and linear algebra (LA). As a result, SDQL enables efficient processing of hybrid DB and LA workloads, by putting together optimizations that are otherwise confined to either DB systems or LA frameworks. We show experimentally that a handful of DB and LA workloads can take advantage of the SDQL language and optimizations. SDQL can be competitive with or outperforms a host of systems that are state of the art in their own domain: in-memory DB systems Typer and Tectorwise for (flat, not nested) relational data; SciPy for LA workloads; sparse tensor compiler taco; the Trance nested relational engine; and the in-database machine learning engines LMFAO and Morpheus for hybrid DB/LA workloads over relational data.
Chapter
Full-text available
Algebraic effects offer a versatile framework that covers a wide variety of effects. However, the family of operations that delimit scopes are not algebraic and are usually modelled as handlers, thus preventing them from being used freely in conjunction with algebraic operations. Although proposals for scoped operations exist, they are either ad-hoc and unprincipled, or too inconvenient for practical programming. This paper provides the best of both worlds: a theoretically-founded model of scoped effects that is convenient for implementation and reasoning. Our new model is based on an adjunction between a locally finitely presentable category and a category of functorial algebras . Using comparison functors between adjunctions, we show that our new model, an existing indexed model, and a third approach that simulates scoped operations in terms of algebraic ones have equal expressivity for handling scoped operations. We consider our new model to be the sweet spot between ease of implementation and structuredness. Additionally, our approach automatically induces fusion laws of handlers of scoped effects, which are useful for reasoning and optimisation.
Conference Paper
Applications in many domains are based on a series of traversals of tree structures, and fusing these traversals together to reduce the total number of passes over the tree is a common, important optimization technique. In applications such as compilers and render trees, these trees are heterogeneous: different nodes of the tree have different types. Unfortunately, prior work for fusing traversals falls short in different ways: they do not handle heterogeneity; they require using domain-specific languages to express an application; they rely on the programmer to aver that fusing traversals is safe, without any soundness guarantee; or they can only perform coarse-grain fusion, leading to missed fusion opportunities. This paper addresses these shortcomings to build a framework for fusing traversals of heterogeneous trees that is automatic, sound, and fine-grained. We show across several case studies that our approach is able to allow programmers to write simple, intuitive traversals, and then automatically fuse them to substantially improve performance.
Conference Paper
In this paper, we show how stream fusion, a program transformation technique used in functional programming, can be adapted for an Object-Oriented setting. This makes it possible to have more Stream operators than the ones currently provided by the Java Stream API. The addition of more operators allows for a greater deal of expressiveness. To this extent, we show how these operators are incorporated in the stream setting. Furthermore, we also demonstrate how a specific set of optimizations eliminates overheads and produces equivalent code in the form of for loops. In this way, programmers are relieved from the burden of writing code in such a cumbersome style, thus allowing for a more declarative and intuitive programming approach.
Article
Full-text available
Deforestation optimises a functional program by transforming it into another one that does not create certain intermediate data structures. Short cut deforestation is a deforestation method which is based on a single, local transformation rule. In return, short cut deforestation expects both producer and consumer of the intermediate structure in a certain form. Warm fusion wan proposed to automatically transform functions into this form. Unfortunately, it is costly and hard to implement. Starting from the fact that short cut deforestation is based on a parametricity theorem of the second-order typed λ-calculus, we show how the required form of a list producer can be derived by the use of type inference. Typability for the second-order typed λ-calculus is undecidable. However, we present a linear-time algorithm that solves a partial type inference problem and that, together with controlled inlining and polymorphic type instantiation, suffices for deforestation. The resulting new short cut deforestation algorithm is efficient and removes more intermediate lists than the original.
Article
Full-text available
User-defined data types, pattern-matching, and recursion are ubiq-uitous features of Haskell programs. Sometimes a function is called with arguments that are statically known to already be in construc-tor form, so that the work of pattern-matching is wasted. Even worse, the argument is sometimes freshly-allocated, only to be im-mediately decomposed by the function. In this paper we describe a simple, modular transformation that spe-cialises recursive functions according to their argument "shapes". We show that such a transformation has a simple, modular imple-mentation, and that it can be extremely effective in practice, elimi-nating both pattern-matching and heap allocation. We describe our implementation of this constructor specialisation transformation in the Glasgow Haskell Compiler, and give measurements of its effec-tiveness.
Conference Paper
Full-text available
Unfolds generate data structures, and folds consume them. A hylomorphism is a fold after an unfold, generating then consuming a virtual data structure. A metamorphism is the opposite composition, an unfold after a fold; typically, it will convert from one data representation to another. In general, metamorphisms are less interesting than hylomorphisms: there is no automatic fusion to deforest the intermediate virtual data structure. However, under certain conditions fusion is possible: some of the work of the unfold can be done before all of the work of the fold is complete. This permits streaming metamorphisms, and among other things allows conversion of infinite data representations. We present a theory of metamorphisms and outline some examples.
Conference Paper
Full-text available
Short cut fusion is a particular program transformation tech- nique which uses a single, local transformation — called the foldr-build rule — to remove certain intermediate lists from modularly constructed functional programs. Arguments that short cut fusion is correct typically appeal either to intuition or to "free theorems" — even though the latter have not been known to hold for the languages supporting higher-order polymorphic functions and fixed point recursion in which short cut fu- sion is usually applied. In this paper we use Pitts' recent demonstration that contextual equivalence in such languages is relationally parametric to prove that programs in them which have undergone short cut fu- sion are contextually equivalent to their unfused counterparts. The same techniques in fact yield a much more general result. For each algebraic data type we define a generalization augment of build which constructs substitution instances of its associated data structures. Together with the well-known generalization cata of foldr to arbitrary algebraic data types, this allows us to formulate and prove correct for each a contextual equivalence-preserving cata-augment fusion rule. These rules optimize compositions of functions that uniformly consume algebraic data struc- tures with functions that uniformly produce substitution instances of them.
Conference Paper
Full-text available
The Haskell String type is notoriously inecient. We intro- duce a new data type, ByteString, based on lazy lists of byte arrays, com- bining the speed benefits of strict arrays with lazy evaluation. Equational transformations based on term rewriting are used to deforest interme- diate ByteStrings automatically. We describe novel fusion combinators with improved expressiveness and performance over previous functional array fusion strategies. A library for ByteStrings is implemented, provid- ing a purely functional interface, which approaches the speed of low-level mutable arrays in C.
Conference Paper
Unfolds generate data structures, and folds consume them. A hylomorphism is a fold after an unfold, generating then consuming a virtual data structure. A metamorphism is the opposite composition, an unfold after a fold; typically, it will convert from one data representation to another. In general, metamorphisms are less interesting than hylomorphisms: there is no automatic fusion to deforest the intermediate virtual data structure. However, under certain conditions fusion is possible: some of the work of the unfold can be done before all of the work of the fold is complete. This permits streaming metamorphisms, and among other things allows conversion of infinite data representations. We present a theory of metamorphisms and outline some examples.
Article
An algorithm that transforms programs to eliminate intermediate trees is presented. The algorithm applies to any term containing only functions with definitions in a given syntactic form, and is suitable for incorporation in an optimizing compiler.
Article
Many compilers do some of their work by means of correctness-preserving, and hopefully performance-improving, program transformations. The Glasgow Haskell Compiler (GHC) takes this idea of “compilation by transformation” as its war-cry, trying to express as much as possible of the compilation process in the form of program transformations.This paper reports on our practical experience of the transformational approach to compilation, in the context of a substantial compiler.
Book
An overview of the Haskell 98 language, which is a general purpose, purely functional programming language incorporating many innovations in programming language design is presented. Haskell provides higher-order functions, non-strict semantics, static polymorphic typing, user-defined algebraic datatypes, pattern-matching, list comprehensions, a module system, and a rich set of primitive datatypes. The syntax for Haskell programs and an informal abstract semantics for the meaning of such programs is defined.