Available via license: CC BY-NC-ND 3.0
Content may be subject to copyright.
SPPF-Style Parsing From Earley Recognisers
Elizabeth Scott
1
Department of Computer Science
Royal Holloway, University of London
Egham, Surrey, United Kingdom
Abstract
In its recogniser form, Earley’s algorithm for testing whether a string can be derived from a grammar is
worst case cubic on general context free grammars (CFG). Earley gave an outline of a method for turning
his recognisers into parsers, but it turns out that this method is incorrect. Tomita’s GLR parser returns
a shared packed parse forest (SPPF) representation of all derivations of a given string from a given CFG
but is worst case unbounded polynomial order. We have given a modified worst-case cubic version, the
BRNGLR algorithm, that, for any string and any CFG, returns a binarised SPPF representation of all
possible derivations of a given string. In this paper we apply similar techniques to develop two versions of
an Earley parsing algorithm that, in worst-case cubic time, return an SPPF representation of all derivations
of a given string from a given CFG.
Keywords: Earley parsing, cubic generalised parsing, context free languages
Since Knuth’s seminal 1960’s work on LR parsing [14] was extended to LALR
parsers by DeRemer [5,4], the Computer Science community has been able to auto-
matically generate parsers for a very wide class of context free languages. However,
many parsers are still written manually, either using tool support or even completely
by hand. This is partly because in some application areas such as natural language
processing and bioinformatics we do not have the luxury of designing the language
so that it is amenable to known parsing techniques, but also it is clear that left to
themselves computer language designers do not naturally write LR(1) grammars.
A grammar not only defines the syntax of a language, it is also the starting point
for the definition of the semantics, and the grammar which facilitates semantic def-
inition is not usually the one which is LR(1). This is illustrated by the development
of the Java Standard. The first edition of the Java Language Specification [7] con-
tains a detailed discussion of the need to modify the grammar used to define the
syntax and semantics in the main part of the standard to make it LALR(1) for
compiler generation purposes. In the third version of the standard [8] the compiler
version of the grammar is written in EBNF and is (unnecessarily) ambiguous, il-
1
Email:e.scott@rhul.ac.uk
Electronic Notes in Theoretical Computer Science 203 (2008) 53–67
1571-0661/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
www.elsevier.com/locate/entcs
doi:10.1016/j.entcs.2008.03.044
lustrating the difficulty of making correct transformations. Given this difficulty in
constructing natural LR(1) grammars that support desired semantics, the general
parsing techniques, such as the CYK [20], Earley [6] and GLR [19] algorithms, de-
veloped for natural language processing are also of interest to the wider computer
science community.
When using grammars as the starting point for semantics definition, we distin-
guish between recognisers which simply determine whether or not a given string is in
the language defined by a given grammar, and parsers which also return some form
of derivation of the string, if one exists. In their basic forms the CYK and Earley
algorithms are recognisers while GLR-style algorithms are designed with derivation
tree construction, and hence parsing, in mind. However, in both recogniser and
parser form, Tomita’s GLR algorithm is of unbounded polynomial order in worst
case. In this paper we describe the expansion of Earley recognisers to parsers which
are of worst case cubic order.
1 Generalised parsing techniques
There is no known linear time parsing or recognition algorithm that can be used
with all context free grammars. In their recogniser forms the CYK algorithm is
worst case cubic on grammars in Chomsky normal form and Earley’s algorithm
is worst case cubic on general context free grammars and worst case order n
2
on
non-ambiguous grammars. General recognisers must, by definition, be applicable
to ambiguous grammars. Expanding general recognisers to parsers raises several
problems, not least because there can be exponentially many or even infinitely
many derivations for a given input string. A cubic recogniser which was modified
to simply return all derivations could become an unbounded parser.
Of course, it can be argued that ambiguous grammars reflect ambiguous se-
mantics and thus should not be used in practice. This would be far too extreme
a position to take. For example, it is well known that the if-else statement in the
ANSI-standard grammar for C is ambiguous, but a longest match resolution results
in linear time parsers that attach the ‘else’ to the most recent ‘if’, as specified by
the ANSI-C semantics. The ambiguous ANSI-C grammar is certainly practical for
parser implementation. However, in general ambiguity is not so easily handled, and
it is well known that grammar ambiguity is in fact undecidable [11], thus we cannot
expect a parser generator simply to check for ambiguity in the grammar and report
the problem back to the user.
It is possible that many of the ad hoc methods of dealing with specific ambigu-
ity, such as the longest match approach for if-else, can be generalised into standard
classes of typical ambiguity which can be automatically tested for see, for exam-
ple, [3], but this remains a topic requiring further research.
Another possibility is to avoid the issue by just returning one derivation. In [9]
there is an algorithm for generating a rightmost derivation from the output of an
Earley recogniser in at worst cubic time. However, if only one derivation is returned
then this creates problems for a user who wants all derivations and, even in the case
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6754
where only one derivation is required, there is the issue of ensuring that it is the
required derivation that is returned. Furthermore, na¨ıve users may not even be
aware that there was more than one possible derivation.
A truly general parser will return all possible derivations in some form. Perhaps
the most well known representation is the shared packed parse forest (SPPF) de-
scribed and used by Tomita [19]. Using this approach we can at least tell whether
there is more than one derivation of a given string in a given grammar: use a GLR
parser to build an SPPF and then test to see if the SPPF contains any packed
nodes. Tomita’s description of the representation does not allow for the infinitely
many derivations which arise from grammars which contain cycles but it is relatively
simple to modify his formulation to include these, and a fully general SPPF con-
struction, based on Farshi’s version [15] of Tomita’s GLR algorithm, was given by
Rekers [16]. These algorithms are all worst-case unbounded polynomial order and,
in fact, Johnson [12] has shown that Tomita-style SPPFs are worst case unbounded
polynomial size. Thus using such structures will also turn any cubic recognition
technique into a worst case unbounded polynomial parsing technique.
Leaving aside the potential increase in complexity when turning a recogniser into
a parser, it is clear that this process is often difficult to carry out correctly. Earley
gave an algorithm for constructing derivations of a string accepted by his recogniser,
but this was subsequently shown by Tomita [19] to return spurious derivations in
certain cases.
Tomita’s original version of his algorithm failed to terminate on grammars with
hidden left recursion and, as remarked above, had no mechanism for constructing
complete shared packed parse forests for grammars with cycles.
In [2] there is given an outline of an algorithm to turn the recogniser reported
there and in [1] into a parser, but again, as written, this algorithm will generate
spurious derivations as well as the correct ones. The recogniser described in [1]
is not applicable to grammars with hidden left recursion but the closely related
RIGLR algorithm [18] is fully general, and as a recogniser is of worst case cubic
order. There is a parser version which correctly constructs SPPFs but as these are
Tomita-style SPPFs the parser is of unbounded polynomial order.
As we have mentioned, Tomita’s GLR algorithm was designed with parse tree
construction in mind. We have given a GLR algorithm, BRNGLR [17], which is
worst case cubic order and, because the tree building is integral to the algorithm,
the parser, which builds a modified form of SPPF, is also worst case cubic order. In
this paper we apply similar techniques to the Earley recogniser and construct two
versions of a complete Earley parser, both of which are worst case cubic order. Thus
we have an Earley parser which produces an SPPF representation of all derivations
of a given input string in worst case cubic space and time.
2 Background theory
In this section we give a brief description of Earley’s algorithm, for simplicity with-
out lookahead, and show how Earley’s own extension of this to a parser can fail.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 55
We then show how to apply the techniques developed in [17] to correctly generate
a representation of all possible derivations of a given input string from Earley’s
recogniser in worst case cubic time and space.
A context free grammar (CFG) consists of a set N of non-terminal symbols, a
set T of terminal symbols, an element S ∈ N called the start symbol, and a set P
of numbered grammar rules of the form A ::= α where A ∈ N and α is a (possibly
empty) string of terminals and non-terminals. The symbol denotes the empty
string.
A derivation step is an element of the form γAβ⇒γαβ where γ and β are strings
of terminals and non-terminals and A ::= α is a grammar rule. A derivation of τ
from σ is a sequence of derivation steps σ⇒β
1
⇒ ...⇒β
n−1
⇒τ. We may also write
σ
∗
⇒τ or σ
n
⇒τ in this case.
A sentential form is any string α such that S
∗
⇒α, and a sentence is a sentential
form which contains only elements of T. The set, L(Γ), of sentences which can
be derived from the start symbol of a grammar Γ, is defined to be the language
generated by Γ.
A derivation tree is an ordered tree whose root is labelled with the start symbol,
leaf nodes are labelled with a terminal or and interior nodes are labelled with a
non-terminal, A say, and have a sequence of children corresponding to the symbols
on the right hand side of a rule for A.
A shared packed parse forest (SPPF) is a representation designed to reduce the
space required to represent multiple derivation trees for an ambiguous sentence. In
an SPPF, nodes which have the same tree below them are shared and nodes which
correspond to different derivations of the same substring from the same non-terminal
are combined by creating a packed node for each family of children. Examples are
given in Sections 3 and 4. Nodes can be packed only if their yields correspond to the
same portion of the input string. Thus, to make it easier to determine whether two
alternates can be packed under a given node, SPPF nodes are labelled with a triple
(x, j, i) where a
j+1
...a
i
is a substring matched by x. To obtain a cubic algorithm
we use binarised SPPFs which contain intermediate additional nodes but which are
of worst case cubic size. (The SPPF is said to be binarised because the additional
nodes ensure that nodes whose children are not packed nodes have out-degree at
most two.)
Earley’s recognition algorithm constructs, for each position i in the input string
a
1
...a
n
, a set of items. Each item represents a position in the grammar that a top
down parser could be in after matching a
1
...a
i
. In detail, the set E
0
is initially
set to be the items (S ::= ·α, 0). For i>0, E
i
is initially set to be the items
(A ::= αa
i
· β,j) such that (A ::= α · a
i
β,j) ∈ E
i−1
. The sets E
i
are constructed in
order and ‘completed’ by adding items as follows: for each item (B ::= γ·Dδ, k) ∈ E
i
and each grammar rule D ::= ρ,(D ::= ·ρ, i) is added to E
i
, and for each item
(B ::= ν·,k) ∈ E
i
,if(D ::= τ · Bμ, h) ∈ E
k
then (D ::= τB · μ, h) is added to E
i
.
The input string is in the language of the grammar if and only if there is an item
(S ::= α·, 0) ∈ E
n
.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6756
As an example consider the grammar
S ::= ST | aB::= T::= aB | a
and input string aa. The Earley sets are
E
0
= {(S ::= ·ST,0), (S ::= ·a, 0)}
E
1
= {(S ::= a·, 0), (S ::= S · T,0), (T ::= ·aB, 1), (T ::= ·a, 1)}
E
2
= {(T ::= a · B, 1), (T ::= a·, 1), (B ::= ·, 2), (S ::= ST·, 0),
(T ::= aB·, 1), (S ::= S · T,0), (T ::= ·aB, 2), (T ::= ·a, 2)}
3 Problems with Earley parser construction
Earley’s original paper gives a brief description of how to construct a representation
of all possible derivation trees from the recognition algorithm, and claims that this
requires at most cubic time and space. The proposal is to maintain pointers from
the non-terminal instances on the right hand sides of a rule in an item to the item
that ‘generated’ that item. So, if (D ::= τ · Bμ,h) ∈ E
k
and (B ::= δ·,k) ∈ E
i
then a pointer is assigned from the instance of B on the left of the dot in (D ::=
τB · μ, h) ∈ E
i
to the item (B ::= δ·,k) ∈ E
i
. In order to keep the size of the sets
E
i
in the parser version of the algorithm the same as the size in the recogniser we
add pointers from the instance of B in (D ::= τB · μ, h) to each of the items of the
form (B ::= δ
·,k
)inE
i
.
Example 1
Applying this approach to the grammar from the previous section, and the string
aa, gives the following structure.
(S ::= ·ST, 0)
(S ::= ·a, 0)
(B ::= ·, 2)
(S ::= a·, 0)
(S ::= ST·, 0)(S ::= S · T,0)
(S ::= S · T,0)
E
0
E
1
E
2
(T ::= ·aB, 2)
(T ::= ·aB, 1)
(T ::= ·a, 1)
(T ::= a · B, 1)
(T ::= ·a, 2)
(T ::= a·, 1)
(T ::= aB·, 1)
U
From this structure the SPPF below can be constructed, as follows.
S, 0, 2
S, 0, 1
a, 0, 1
T,1, 2
a, 1, 2
B, 2, 2
ee
l
=
~
?
=R
)
@
@
u
0
u
1
u
2
u
3
u
5
u
4
j
We start with (S ::= ST·, 0) in E
2
. Since the integer in the item is 0 and it lies
in the level 2 Earley set, we create a node, u
0
, labelled (S, 0, 2). The pointer from S
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 57
points to (S ::= a·, 0) in E
1
, so we create a child node, u
1
, labelled (S, 0, 1). From
u
1
we create a child node, u
2
, labelled (a, 0, 1). Returning to u
0
, there is a pointer
from T that points to (T ::= aB·, 1) in E
2
, so we create a child node, u
3
, labelled
(T,1, 2). From u
3
we create a child node u
4
labelled (a, 1, 2) and, using the pointer
from B, a child node, u
5
, labelled (B, 2, 2), which in turn has child labelled . There
is another pointer from T that points to (T ::= a·, 1) in E
2
. We already have an
SPPF node, u
3
, labelled (T,1, 2) so we reuse this node. We also have a node, u
4
,
labelled (a, 1, 2). However, u
3
does not have a family of children consisting of the
single element u
4
, so we pack its existing family of children under a new packed
node and create a further packed node with child u
4
.
The procedure proposed by Earley works correctly for the above example, but
adding multiple pointers to a given instance of a non-terminal can create errors. As
remarked in [19] p74, if we consider the grammar
S ::= SS | b
and the input string bbb we find that the above procedure generates the correct
derivations of bbb but also spurious derivations of the strings bbbb and bb. The
problem is that the derivation of bb from the left-most S in one derivation of bbb
becomes intertwined with the derivation of bb from the rightmost S in the other
derivation, resulting in the creation of bbbb.
We could avoid this problem by creating separate instances of the items for
different substring matches, so if (B ::= δ·,k), (B ::= σ·,k
) ∈ E
i
where k = k
then
we create two copies of (D ::= τB · μ, h) one pointing to each of the two items. In
the above example we would create two items (S ::= SS·, 0) in E
3
one in which
the second S points to (S ::= b·, 2) and the other in which the second S points to
(S ::= SS·, 1). This would cause correct derivations to be generated, but it also
effectively embeds all the derivation trees in the construction and, as reported by
Johnson, the size cannot be bounded by O(n
p
) for any fixed integer p.
For example, using such a method for input b
n
to the grammar
S ::= SSS | SS | b
the set E
i
constructed by the parser will contain Ω(i
3
) items and hence the com-
plete structure contains Ω(n
4
) elements. Thus this version of Earley’s method does
not result in a cubic parser. To see this note first that, when constructed by the
recogniser, the Earley set E
i
is the union of the sets
U
0
= {(S ::= b·,i− 1), (S ::= ·SSS,i), (S ::= ·SS,i), (S ::= ·b, i)}
U
1
= {(S ::= S · SS,k) | i − 1 ≥ k ≥ 0}
U
2
= {(S ::= S · S, k) | i − 1 ≥ k ≥ 0}
U
3
= {(S ::= SS·,k) | i − 1 ≥ k ≥ 0}
U
4
= {(S ::= SS · S, k) | i − 2 ≥ k ≥ 0}
U
5
= {(S ::= SSS·,k) | i − 3 ≥ k ≥ 0}.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6758
If we add pointers then, since there are i elements (S ::= SS·,q)inE
i
,0≤ q ≤
(i−1), and (S ::= ·SSS,q) ∈ E
q
, we will add i elements of the form (S ::= S · SS,q)
to E
i
. Then E
q
will have q elements of the form (S ::= S · SS,p), 0 ≤ p ≤ (q − 1),
so we will add i(i − 1)/2 elements of the form (S ::= SS · S, r)toE
i
,0≤ r ≤ (i − 1).
Finally, E
q
will have q(q−1)/2 elements of the form (S ::= SS·S, p), 0 ≤ p ≤ (q−1),
so we will add i(i − 1)(i − 3)/6 elements of the form (S ::= SSS·,r)toE
i
.
Grune [10] has described a parser which exploits an Unger style parser to con-
struct the derivations of a string from the sets produced by Earley’s recogniser.
However, as noted by Grune, in the case where the number of derivations is expo-
nential the resulting parser will be of at least unbounded polynomial order in worst
case.
4 A cubic parser which walks the Earley sets
We can turn Earley’s algorithm into a correct parser by adding pointers between
items rather than instances of non-terminals, and labelling the pointers in a way
which allows a binarised SPPF to be constructed by walking the resulting structure.
(In the next section we shall give a version of the algorithm that constructs a
binarised SPPF as the Earley sets are constructed.)
Set E
0
to be the items (S ::= ·α, 0). For i>0 initialise E
i
by adding the item
p =(A ::= αa
i
· β,j) for each q =(A ::= α · a
i
β,j) ∈ E
i−1
and, if α = , creating a
predecessor pointer labelled i − 1 from q to p. Before initialising E
i+1
complete E
i
as follows. For each item (B ::= γ · Dδ, k) ∈ E
i
and each rule D ::= ρ,(D ::= ·ρ, i)
is added to E
i
. For each item t =(B ::= τ·,k) ∈ E
i
and each corresponding item
q =(D ::= τ · Bμ, h) ∈ E
k
, if there is no item p =(D ::= τB· μ, h) ∈ E
i
create one.
Add a reduction pointer labelled k from p to t and, if τ = , a predecessor pointer
labelled k from p to q.
We could walk the above structure in a fashion that is essentially the same as
described in Example 1 above. However, in order to construct a binarised SPPF we
also have to introduce additional nodes for grammar rules of length greater than
two. Thus the final algorithm is slightly more complicated.
An interior node, u, of the SPPF is either a symbol node labelled (B,j, i)oran
intermediate node labelled (B ::= γx · δ, j, i). A family of children of u will consist
of one or two nodes. For a symbol node the family will correspond to a grammar
rule B ::= γy or B ::= .Ifγ = then the children will be labelled (B ::= γ · y, j,l)
and (y, l, i), for some l. Otherwise there will be a single child in the family, labelled
(y, j,i)or. For an additional node the family will have a child labelled (x, l, i). If
γ = then the family will have a second child labelled (B ::= γ · xδ, j, l).
We now define a function which takes an SPPF node u and an item p from
an Earley set E
i
, possibly decorated with pointers, and builds the corresponding
part of the SPPF. A decorated item consists of a LR(0)-item, A ::= α · β, a left
hand index j, a right hand index, i, and a set of associated labelled pointers. We
assume that these attributes and the complete Earley set structure are passed into
Buildtree with u and p.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 59
Buildtree(u,p) {
suppose that p ∈ E
i
and that p is of the form (A ::= α · β, j)
mark p as processed
if p =(A ::= ·,j) {
if there is no SPPF node v labelled (A, i, i)
create one with child node
if u does not have a family (v) then add the family (v)tou }
if p =(A ::= a · β,j) (where a is a terminal) {
if there is no SPPF node v labelled (a, i − 1,i) create one
if u does not have a family (v) then add the family (v)tou }
if p =(A ::= C · β, j) (where C is a non-terminal) {
if there is no SPPF node v labelled (C, j, i) create one
if u does not have a family (v) then add the family (v)tou
for each reduction pointer from p labelled j {
suppose that the pointer points to q
if q is not marked as processed Buildtree(v,q) }}
if p =(A ::= α
a · β,j) (where a is a terminal, α
= ) {
if there is no SPPF node v labelled (a, i − 1,i) create one
if there is no SPPF node w labelled (A ::= α
· aβ, j, i − 1) create one
for each target p
of a predecessor pointer labelled i − 1 from p {
if p
is not marked as processed Buildtree(w,p
) }
if u does not have a family (w, v) add the family (w, v)tou }
if p =(A ::= α
C · β, j) (where C is a non-terminal, α
= ) {
for each reduction pointer from p {
suppose that the pointer is labelled l and points to q
if there is no SPPF node v labelled (C, l, i) create one
if q is not marked as processed Buildtree(v,q)
if there is no SPPF node w labelled (A ::= α
x · Cβ,j,l) create one
for each target p
of a predecessor pointer labelled l from p {
if p
is not marked as processed Buildtree(w,p
) }
if u does not have a family (w, v) add the family (w, v)tou }}
}
We build the full SPPF from the root down using the following procedure.
PARSER { create an SPPF node u
0
labelled (S, 0,n)
for each decorated item p =(S ::= α·, 0) ∈ E
n
Buildtree(u
0
, p) }
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6760
We illustrate this approach using two examples: the first is the example, dis-
cussed above, that results in an error when Earley’s parsing approach is used; and
the second is a grammar with hidden left recursion and a cycle, resulting in infinitely
many derivations.
Example 2 Grammar : S ::= SS| b Input : bbb
The Earley set structure is essentially
(S ::= ·b, 0)
(S ::= ·SS, 1)
(S ::= ·SS, 0)
(S ::= ·b, 1)
(S ::= b·, 0)
(S ::= S · S, 1)(S ::= S · S, 0)
(S ::= ·SS, 2) (S ::= S · S, 2)
(S ::= SS·, 1)
(S ::= ·b, 3)
(S ::= S · S, 0) (S ::= SS·, 0)
E
0
E
1
E
3
E
2
(S ::= SS·, 0) (S ::= S · S, 1)
(S ::= ·SS, 3)
(S ::= S · S, 0)(S ::= ·b, 2)
(S ::= b·, 1) (S ::= b·, 2)
01
0
2
1
1
1
2
Y
}
2
Y
I
2
1
(for ease of reading pointers from nodes not reachable from the node in E
3
labelled
(S ::= SS·, 0) have been left off the diagram). The corresponding (correct) binarised
SPPF, with the nodes labelled in construction order, is
S, 0, 3
S, 2, 3 S, 1, 3
S, 0, 2
S, 0, 1 S, 1, 2
b, 0, 1 b, 1, 2
b, 2, 3
ee
u
0
u
1
u
3
u
7
u
11
u
10
u
4
u
8
u
5
u
9
u
6
u
2
S ::= S · S, 0, 2
S ::= S · S, 0, 1 S ::= S · S, 1, 2
?
@
@
9
Q
Qs
w
?
?
?
?
i
q
Example 3
Grammar : S ::= AT | aT A::= a | BA B::= T::= bbb
Input : abbb
The Earley set structure is
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 61
(S ::= ·aT, 0)
(T ::= ·bbb, 1)
(S ::= ·AT, 0)
(S ::= A · T, 0)
(S ::= a · T, 0)
(A ::= a·, 0)
(S ::= AT ·, 0)
(A ::= ·a, 0)
(S ::= aT ·, 0)
(A ::= BA·, 0)(B ::= ·, 0)
E
0
E
1
E
4
E
2
E
3
(A ::= ·BA, 0)
(A ::= B · A, 0)
(T ::= b · bb, 1)
(T ::= bb · b, 1)
(T ::= bbb·, 1)
U
0
0
1
1
Y
]
)
0
0
0
0
i
i
i
2
3
1
1
and the corresponding binarised SPPF is
S, 0, 4
T,1, 4
b, 1, 2
A, 0, 1
B, 0, 0
a, 0, 1
b, 3, 4
b, 2, 3
ee
e
e
u
0
u
1
u
7
u
3
u
5
u
11
u
9
u
6
u
10
u
12
u
8
u
2
u
4
S ::= a · T, 0, 1
T ::= bb · b, 1, 3
T ::= b · bb, 1, 2
A ::= B · A, 0, 0
S ::= A · T, 0, 1
@
@
9
Q
Qs
`
`
`
/
?
?
q
=
Q
Qs
?
9
z
i
?
l
z
5 An integrated parsing algorithm
The Buildtree function described in the previous section is not as efficient as it could
be, it has been designed to reflect the principles underlying the approach. We now
give a different version of an Earley parser that constructs a binarised SPPF as the
Earley sets are constructed, and does not require the items to be decorated with
pointers.
The SPPF constructed is similar to the binarised SPPF constructed by the
BRNGLR algorithm but the additional nodes are the left hand rather than right
hand children, reflecting the fact that Earley’s recogniser is essentially top down
rather than bottom up. It is also slightly smaller than the corresponding SPPF
from the previous section as a node with a label of the form (A ::= x · β,j,i)is
merged with its child.
The algorithm itself is in a form that is similar to the form in which GLR
algorithms are traditionally presented. There is a step in the algorithm for each
element of the input string and at step i the Earley set E
i
is constructed, along
with all the SPPF nodes with labels of the form (s, j, i), j ≤ i.
In order to construct the SPPF as the Earley sets are built, we record with each
Earley item the SPPF node that corresponds to it. Thus Earley items are triples
(s, j, w) where s is a non-terminal or an LR(0) item, j is an integer and w is an
SPPF node with a label of the form (s, j, l). The subtree below such a node w will
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6762
correspond to the derivation of the substring a
j+1
...a
l
, from B if s is B and from
α if s is B ::= α · β. Earley items of the form (A ::= α · β,j) where |α|≤1 do not
have associated SPPF nodes, so we use the dummy node null in this case.
The items in each E
i
have to be ‘processed’ either to add more elements to E
i
or
to form the basis of the next set E
i+1
. Thus when an item is added to E
i
it is also
added to a set Q,ifitisoftheform(A ::= α · a
i+1
β, h, w), or to a set R otherwise.
Elements are removed from R as they are processed and when R is empty the items
in Q are processed to initialise E
i+1
.
There is a special case when an item of the form (A ::= α·,i,w)isinE
i
,
this happens if A⇒α
∗
⇒. When this item is processed items of the form (X ::=
τ · Aδ,i,v) ∈ E
i
have to be considered and it is possible that an item of this form
may be created after the item (A ::= α·,i,w) has been processed. Thus we use a
set H and, when (A ::= α·,i,w) is processed, the pair (A, w) is added to H. Then
when (X ::= τ · Aδ,i,v) is processed elements of H are checked and appropriate
action is taken.
When an SPPF node is needed we first check to see if one with the required
label already exists. To facilitate this checking the SPPF nodes constructed at the
current step are added to a set V.
In the following algorithm Σ
N
denotes the set of all strings of terminals and
non-terminals that begin with a non-terminal, together with the empty string, .
Input: a grammar Γ = (N, T,S,P) and a string a
1
a
2
...a
n
EARLEY PARSER {
E
0
,...,E
n
, R, Q
, V = ∅
for all (S ::= α) ∈P {
if α ∈ Σ
N
add (S ::= ·α, 0, null)toE
0
if α = a
1
α
add (S ::= ·α, 0, null)toQ
}
for 0 ≤ i ≤ n {
H = ∅, R = E
i
, Q = Q
Q
= ∅
while R = ∅{
remove an element, Λ say, from R
if Λ=(B ::= α · Cβ,h,w) {
for all (C ::= δ) ∈P {
if δ ∈ Σ
N
and (C ::= ·δ, i, null) ∈ E
i
{
add (C ::= ·δ, i, null)toE
i
and R}
if δ = a
i+1
δ
{ add (C ::= ·δ, i, null)toQ}}
if ((C, v) ∈H) {
let y = MAKE
NODE(B ::= αC · β, h, i, w, v, V)
if β ∈ Σ
N
and (B ::= αC · β, h, y) ∈ E
i
{
add (B ::= αC · β, h, y)toE
i
and R}
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 63
if β = a
i+1
β
{ add (B ::= αC · β, h, y)toQ}}}
if Λ=(D ::= α·,h,w) {
if w = null {
if there is no node v ∈V labelled (D, i, i) create one
set w = v
if w does not have family () add one }
if h = i { add (D, w)toH}
for all (A ::= τ · Dδ,k,z)inE
h
{
let y = MAKE
NODE(A ::= τD · δ, k, i, z, w, V)
if δ ∈ Σ
N
and (A ::= τD · δ, k, y) ∈ E
i
{
add (A ::= τD · δ, k, y)toE
i
and R}
if δ = a
i+1
δ
{ add (A ::= τD · δ, k, y)toQ}}}
}
V = ∅
create an SPPF node v labelled (a
i+1
,i,i+1)
while Q = ∅{
remove an element, Λ = (B ::= α · a
i+1
β, h, w)say,fromQ
let y = MAKE
NODE(B ::= αa
i+1
· β, h, i +1,w,v,V)
if β ∈ Σ
N
{ add (B ::= αa
i+1
· β, h, y)toE
i+1
}
if β = a
i+2
β
{ add (B ::= αa
i+1
· β, h, y)toQ
}
}
}
if (S ::= τ·, 0,w) ∈ E
n
return w
else return failure
}
MAKE
NODE(B ::= αx · β,j,i,w,v,V) {
if β = { let s = B } else { let s =(B ::= αx · β) }
if α = and β = { let y = v }
else {
if there is no node y ∈V labelled (s, j, i) create one and add it to V
if w = null and y does not have a family of children (v) add one
if w = null and y does not have a family of children (w, v) add one }
return y
}
Using this algorithm on Example 3 from Section 4 results in the following SPPF.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6764
S, 0, 4
T,1, 4
b, 1, 2
A, 0, 1
B, 0, 0
a, 0, 1
b, 3, 4
b, 2, 3
ee
e
e
u
9
u
8
u
6
u
4
u
3
u
1
u
2
u
7
u
5
T ::= bb · b, 1, 3
@
@
9
?
q
=
Q
Qs
l
b
b
?
z
z
=
6 The order of the parsers
(As we have done throughout the paper, in this section we use n to denote the
length of the input to the parser.)
A formal proof that the binarised SPPFs constructed by the BRNGLR algorithm
contain at most O(n
3
) nodes and at most O(n
3
) edges is given in [17]. The proof
that the binarised SPPFs constructed by the parsers described in this paper are of at
most cubic size is the same, and we do not give it here. Intuitively however, the non-
packed nodes are characterised by an LR(0)-item and two integers, 0 ≤ j ≤ i ≤ n,
and thus there are at most O(n
2
) of them. Packed nodes are children of some non-
packed node, labelled (s, j, i) say, and for a given non-packed node the packed node
children are characterised by an LR(0)-item and an integer l which lies between
j and i. Thus each non-packed node has at most O(n) packed node children and
there are at most O(n
3
) packed nodes in a binarised SPPF. As non-packed nodes
are the source of at most O(n) edges and packed nodes are the source of at most
two edges, there are also at most O(n
3
) edges in a binarised SPPF.
For the parsing approach based on the Buildtree procedure described in Sec-
tion 4, the Earley sets are constructed as for Earley’s original algorithm. There
are at most O(n
2
) items and each item has at most O(n) predecessor pointers,
one to each of the collections E
j
,0≤ j ≤ i. Because an item is marked as pro-
cessed as soon as Buildtree is called on it, the parsing process makes at most O(n
2
)
calls to Buildtree. Assuming that the SPPF is represented in a way that allows
n-independent look-up time for a particular node and family of children, the only
n-dependent behaviour of Buildtree occurs during the iteration over the predecessor
pointers from the input item, and there are at most O(n) such pointers. It is pos-
sible to represent the SPPF in the required fashion, one such representation being
described in [17]. Thus our Earley parsers can be implemented so that they have
worst-case cubic order.
Finally we consider the integrated Earley parser given in Section 5. The while-
loop that processes the elements in R executes once for each element added to E
i
.
For each triple (s, j, i) there is at most one SPPF node labelled with this triple, and
thus there are at most O(n) items in E
i
. So the while-loop executes at most O(n)
times. As we have already remarked, it is possible to implement the SPPF to allow
n-independent look-up time for a given node and family of children. Thus, within
the while-loop for R, the only case that triggers potentially n-dependent behaviour
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 65
is the case when the item chosen is of the form (D ::= α·,h,w). In this case the
set E
h
must be searched. This is a worst-case O(n) operation. The while-loop that
processes Q is not n-dependent, thus the integrated parser is worst case O(n
3
).
7 Summary and further work
In this paper we have given two versions of a parser based on Earley’s recognition
algorithm, both of which are of worst case cubic order.
Both algorithms construct a binarised SPPF that represents all possible deriva-
tions of the given input string. The approach is based on the approach taken in
BRNGLR, a cubic version of Tomita’s algorithm, and the SPPFs constructed are
equivalent to those constructed by BRNGLR. Some experimental results compar-
ing the recogniser versions of BRNGLR and Earley’s algorithm are reported in [13].
Now further experimental work is required to compare the performance of the inte-
grated Earley parser described in this paper with the parser version of BRNGLR.
References
[1] John Aycock and Nigel Horspool. Faster generalised LR parsing. In Compiler Construction, 8th Intnl.
Conf, CC’99, volume 1575 of Lecture Notes in Computer Science, pages 32 – 46. Springer-Verlag, 1999.
[2] John Aycock, R. Nigel Horspool, Jan Janousek, and Borivo Melichar. Even faster generalised LR
parsing. Acta Informatica, 37(8):633–651, 2001.
[3] Claus Brabrand. Grambiguity. http://www.brics.dk/
~
brabrand/grambiguity/, 2006.
[4] Frank L DeRemer and Thomas J. Pennello. Efficient computation of LALR(1) look-ahead sets. ACM
Trans. Progam. Lang. Syst., 4(4):615–649, October 1982.
[5] Franklin L DeRemer. Practical translators for LR(k) languages. PhD thesis, Massachussetts Institute
of Technology, 1969.
[6] J Earley. An efficient context-free parsing algorithm. Communications of the ACM, 13(2):94–102,
February 1970.
[7] James Gosling, Bill Joy, and Guy Steele. The Java Language Specification. Addison-Wesley, 1996.
[8] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification Third Edition.
Addison-Wesley, 2005.
[9] Susan L. Graham and Michael A. Harrison. Parsing of general context-free languages. Advances in
Computing, 14:77–185, 1976.
[10] Dick Grune and Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood, Chichester,
England. (See also: http://www.cs.vu.nl/
~
dick/PTAPG.html), 1990.
[11] John E Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and
Computation. Series in Computer Science. Addison-Wesley, 1979.
[12] Mark Johnson. The computational complexity of GLR parsing. In Masaru Tomita, editor, Generalized
LR parsing, pages 35–42. Kluwer Academic Publishers, The Netherlands, 1991.
[13] Adrian Johnstone, Elizabeth Scott, and Giorgios Economopoulos. Generalised parsing: some costs. In
Evelyn Duesterwald, editor, Compiler Construction, 13th Intnl. Conf, CC’04, volume 2985 of Lecture
Notes in Computer Science, pages 89–103. Springer-Verlag, Berlin, 2004.
[14] Donald E Knuth. On the translation of languages from left to right. Information and Control, 8(6):607–
639, 1965.
[15] Rahman Nozohoor-Farshi. GLR parsing for -grammars. In Masaru Tomita, editor, Generalized LR
Parsing, pages 60–75. Kluwer Academic Publishers, The Netherlands, 1991.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–6766
[16] Jan G. Rekers. Parser generation for interactive environments. PhD thesis, University of Amsterdam,
1992.
[17] E.A. Scott, A.I.C. Johnstone, and G.R. Economopoulos. BRN-table based GLR parsers. Technical
Report TR-03-06, Computer Science Department, Royal Holloway, University of London, London, 2003.
[18] Elizabeth Scott and Adrian Johnstone. Generalised bottom up parsers with reduced stack activity.
The Computer Journal, 48(5):565–587, 2005.
[19] Masaru Tomita. Efficient parsing for natural language. Kluwer Academic Publishers, Boston, 1986.
[20] D H Younger. Recognition of context-free languages in time n
3
. Inform. Control, 10(2):189–208,
February 1967.
E. Scott / Electronic Notes in Theoretical Computer Science 203 (2008) 53–67 67