Conference PaperPDF Available

A Graphical Model for Context-Free Grammar Parsing

Authors:

Abstract and Figures

In the compiler literature, parsing algorithms for context-free grammars are presented using string rewriting systems or abstract machines such as pushdown automata. Unfortunately, the resulting descriptions can be baroque, and even a basic understanding of some parsing algorithms, such as Earley’s algorithm for general context-free grammars, can be elusive. In this paper, we present a graphical representation of context-free grammars called the Grammar Flow Graph (GFG) that permits parsing problems to be phrased as path problems in graphs; intuitively, the GFG plays the same role for context-free grammars that nondeterministic finite-state automata play for regular grammars. We show that the GFG permits an elementary treatment of Earley’s algorithm that is much easier to understand than previous descriptions of this algorithm. In addition, look-ahead computation can be expressed as a simple inter-procedural dataflow analysis problem, providing an unexpected link between front-end and back-end technologies in compilers. These results suggest that the GFG can be a new foundation for the study of context-free grammars.
Content may be subject to copyright.
A Graphical Model for Context-free Grammar Parsing
Keshav Pingali1and Gianfranco Bilardi2
1The University of Texas, Austin,
Texas 78712, USA.
Email: pingali@cs.utexas.edu
2Universit`
a di Padova
35131 Padova, Italy.
Email:bilardi@dei.unipd.it.
Abstract. In the compiler literature, parsing algorithms for context-free gram-
mars are presented using string rewriting systems or abstract machines such as
pushdown automata. Unfortunately, the resulting descriptions can be baroque,
and even a basic understanding of some parsing algorithms, such as Earley’s
algorithm for general context-free grammars, can be elusive. In this paper, we
present a graphical representation of context-free grammars called the Grammar
Flow Graph (GFG) that permits parsing problems to be phrased as path problems
in graphs; intuitively, the GFG plays the same role for context-free grammars
that nondeterministic finite-state automata play for regular grammars. We show
that the GFG permits an elementary treatment of Earley’s algorithm that is much
easier to understand than previous descriptions of this algorithm. In addition,
look-ahead computation can be expressed as a simple inter-procedural dataflow
analysis problem, providing an unexpected link between front-end and back-end
technologies in compilers. These results suggest that the GFG can be a new foun-
dation for the study of context-free grammars.
1 Introduction
The development of elegant and practical parsing algorithms for context-free grammars
is one of the major accomplishments of 20th century Computer Science. Two abstrac-
tions are used to present these algorithms: string rewriting systems and pushdown au-
tomata, but the resulting descriptions are unsatisfactory for several reasons.
2 Keshav Pingali and Gianfranco Bilardi
Even an elementary understanding of some grammar classes requires mastering
a formidable number of complex concepts. For example, LR(k) parsing requires
an understanding of rightmost derivations, right sentential forms, viable prefixes,
handles, complete valid items, and conflicts, among other notions.
Parsing algorithms for different grammar classes are presented using different ab-
stractions; for example, LL grammars are presented using recursive-descent, while
LR grammars are presented using shift-reduce parsers. This obscures connections
between different grammar classes and parsing techniques.
Although regular grammars are a proper subset of context-free grammars, parsing
algorithms for regular grammars, which are presented using finite-state automata,
appear to be entirely unrelated to parsing algorithms for context-free grammars.
In this paper, we present a novel approach to context-free grammar parsing that is
based on a graphical representation of context-free grammars called the Grammar Flow
Graph(GFG). Intuitively, the GFG plays the same role for context-free grammars that
the nondeterministic finite-state automaton (NFA) does for regular grammars: parsing
problems can be formulated as path problems in the GFG, and parsing algorithms be-
come algorithms for solving these path problems. The GFG simplifies and unifies the
presentation of parsing algorithms for different grammar classes; in addition, finite-
state automata can be seen as an optimization of the GFG for the special case of regular
grammars, providing a pleasing connection between regular and context-free grammars.
Section 2 introduces the GFG, and shows how the GFG for a given context-free
grammar can be constructed in a straight-foward way. Membership of a string in the
language generated by the grammar can be proved by finding what we call a complete
balanced GFG path that generates this string. Since every regular grammar is also a
context-free grammar, a regular grammar has both a GFG and an NFA representation.
In Section 2.4, we establish a connection between these representations: we show that
applying the continuation-passing style (CPS) optimization [13,18] to the GFG of a
right-linear regular grammar produces an NFA that is similar to the NFA produced by
the standard algorithm for converting a right-linear regular grammar to an NFA.
Earley’s algorithm[6] for parsing general context-free grammars is one of the more
complicated parsing algorithms in the literature [1]. The GFG reveals that this algorithm
is a straightforward extension of the well-known “-closure” algorithm for simulating
all the moves of an NFA (Section 3). The resulting description is much simpler than
previous descriptions of this algorithm, which are based on dynamic programming,
abstract interpretation, and Galois connections [6,8,5].
Look-ahead is usually presented in the context of particular parsing strategies such
as SLL(1) parsing. In Section 4, we show that the GFG permits look-ahead computa-
tion to be formulated independently of the parsing strategy as a simple inter-procedural
dataflow analysis problem, unifying algorithmic techniques for compiler front-ends and
back-ends. The GFG also enables a simple description of parsers for LL and LR gram-
mars and their sub-classes such as SLL, SLR and LALR grammars, although we do not
discuss this in this paper.
Section 5 describes related work. Structurally, the GFG resembles the recursive
transition network (RTN) [20], which is used in natural language processing and parsers
like ANTLR [10], but there are crucial differences. In particular, the GFG is a single
A Graphical Model for Context-free Grammar Parsing 3
graph in which certain paths are of interest, not a collection of recursive state machines
with an operational model like chart parsing for their interpretation. Although motivated
by similar concerns, complete balanced paths are different from CFL-paths [21].
Proofs of the main theorems are given in the appendix.
2 Grammar Flow Graph (GFG) and complete balanced paths
A context-free grammar Γis a 4-tuple <N, T , P, S> where Nis a finite set of non-
terminals, Tis a finite set of terminals, PN×(NT)is the set of productions,
and SNis the start symbol. To simplify the development, we make the following
standard assumptions about Γthroughout this paper.
A1: Sdoes not appear on the righthand side of any production.
A2: Every non-terminal is used in a derivation of some string of terminals from S
(no useless non-terminals [1]).
Any grammar Γ0can be transformed in time O(|Γ0|)into an equivalent grammar Γ
satisfying the above assumptions [17]. The running example in this paper is this gram-
mar: Eint|(E+E)|E+E. An equivalent grammar is shown in Figure 1, where the
production SEhas been added to comply with A1.
2.1 Grammar flow graph (GFG)
Figure 1 shows the GFG for the expression grammar. Some edges are labeled explicitly
with terminal symbols, and the others are implicitly labeled with . The GFG can be
understood by analogy with inter-procedural control-flow graphs: each production is
represented by a “procedure” whose control-flow graph represents the righthand side
of that production, and a non-terminal Ais represented by a pair of nodes Aand A,
called the start and end nodes for A, that gather together the control-flow graphs for
the productions of that non-terminal. An occurrence of a non-terminal in the righthand
side of a production is treated as an invocation of that non-terminal.
The control-flow graph for a production Au1u2..urhas r+ 1 nodes. As in finite-
state automata, node labels in a GFG do not play a role in parsing and can be chosen ar-
bitrarily, but it is convenient to label these nodes Au1u2..urthrough Au1u2..ur;
intuitively, the indicates how far parsing has progressed through a production (these la-
bels are related to items [1]). The first and last nodes in this sequence are called the entry
and exit nodes for that production. If uiis a terminal, there is a scan edge with that label
from the scan node Au1..ui1ui..urto node Au1..uiui+1..ur, just as in finite-
state automata. If uiis a non-terminal, it is considered to be an “invocation” of that
non-terminal, so there are call and return edges that connect nodes Au1..ui1ui..ur
to the start node of non-terminal uiand its end node to Au1..uiui+1 ..ur.
Formally, the GF G for a grammar Γis denoted by GF G(Γ)and it is defined as
shown in Definition 1. It is easy to construct the GFG for a grammar Γin O(|Γ|)time
and space using Definition 1.
Definition 1. If Γ=<N, T , P, S> is a context-free grammar, G=GF G(Γ)is the
smallest directed graph (V(Γ), E(Γ)) that satisfies the following properties.
4 Keshav Pingali and Gianfranco Bilardi
Fig. 1. Grammar Flow Graph example
For each non-terminal AN,V(Γ)contains nodes labeled Aand A, called the
start and end nodes respectively for A.
For each production A,V(Γ)contains a node labeled A, and E(Γ)contains
edges (A, A), and (A, A).
For each production Au1u2...ur
V(Γ)contains (r+1) nodes labeled Au1...ur,Au1...ur, ..., Au1...ur,
E(Γ)contains entry edge (A, Au1...ur), and exit edge (Au1...ur, A),
for each uiT,E(Γ)contains a scan edge
(Au1...ui1ui..ur, Au1...uiui+1..ur)labeled ui,
for each uiN,E(Γ)contains a call edge (Au1...ui1ui...ur,ui)and
areturn edge (ui, Au1...uiui+1...ur).
Node Au1...ui1ui...uris a call node, and matches the return node
Au1...uiui+1...ur.
Edges other than scan edges are labeled with .
When the grammar is obvious from the context, a GFG will be denoted by G=(V, E).
Note that start and end nodes are the only nodes that can have a fan-out greater than
one. This fact will be important when we interpret the GFG as a nondeterministic au-
tomaton in Section 2.3.
A Graphical Model for Context-free Grammar Parsing 5
Node type Description
start Node labeled A
end Node labeled A
call Node labeled Aα
return Node labeled AαBγ
entry Node labeled Aα
exit Node labeled Aα
scan Node labeled Aα
Table 1. Classification of GFG nodes: a node can belong to several categories. (A, B N,
tT, and α, γ (T+N))
2.2 Balanced paths
The following definition is standard.
Definition 2. A path in a GFG G=(V, E )is a non-empty sequence of nodes v0, . . . , vl,
such that (v0, v1),(v1, v2), ..., (vl1, vl)are all edges in E.
In a given GFG, the notation v1 v2denotes the edge from v1to v2, and the notation
v1 vndenotes a path from v1to vn; the symbol “” is reserved for productions and
derivations. If Q1:v1 vmand Q2:vm vrare paths in a GFG, the notation Q1+Q2
denotes the concatenation of paths Q1and Q2. In this spirit, we denote string concate-
nation by +as well. It is convenient to define the following terms to talk about certain
paths of interest in the GFG.
Definition 3. Acomplete path in a GFG is a path whose first node is Sand whose last
node is S.
A path is said to generate the word wresulting from the concatenation of the labels
on its sequence of edges. By convention, w=for a path with a single node.
The GFG can be viewed as a nondeterministic finite-state automaton (NFA) whose
start state is S, whose accepting state is S, and which makes nondeterministic choices
at start and end nodes that have a fan-out more than one. Each complete GFG path
generates a word in the regular language recognized by this NFA. In Figure 1, the path
Q:S SE E E(E+E) E(E+E) E Eint
Eint E SE Sgenerates the string ”( int”. However, this string is not
generated by the context-free grammar from which this GFG was constructed.
To interpret the GFG as the representation of a context-free grammar, it is necessary
to restrict the paths that can be followed by the automaton. Going back to the intuition
that the GFG is similar to an inter-procedural call graph, we see that Qis not an inter-
procedurally valid path [14]: at E, it is necessary to take the return edge to node
E(E+E)since the call of Ethat is being completed was made at node E(E+
E). In general, the automaton can make a free choice at start nodes just like an NFA,
but at end nodes, the return edge to be taken is determined by the call that is being
completed.
The paths the automaton is allowed to follow are called complete balanced paths
in this paper. Intuitively, if we consider matching call and return nodes to be opening
6 Keshav Pingali and Gianfranco Bilardi
and closing parentheses respectively of a unique color, the parentheses on a complete
balanced path must be properly nested [4]. In the formal definition below, if Kis a se-
quence of nodes, we let v, K, w represent the sequence of nodes obtained by prepending
node vand appending node wto K.
Definition 4. Given a GFG for a grammar Γ=<N, T , P, S>, the set of balanced se-
quences of call and return nodes is the smallest set of sequences of call and return
nodes that is closed under the following conditions.
The empty sequence is balanced.
The sequence (Aα), K, (AαBγ)is balanced if Kis a balanced sequence,
and production (AαBγ)P.
The concatenation of two balanced sequences v1...vfand y1...ysis balanced if
vf6=y1. If vf=y1, the sequence v1...vfy2...ysis balanced.
This definition is essentially the same as the standard definition of balanced se-
quences of parentheses; the only difference is the case of vf=y1in the last clause,
which arises because a node of the form AαXY β is both a return node and a call
node.
Definition 5. A GFG path v0 vlis said to be a balanced path if its subsequence of
call and return nodes is balanced.
Theorem 1. If Γ=<N, T , P, S> is a context-free grammar and wT,wis in the
language generated by Γiff it is generated by a complete balanced path in GF G(Γ).
Proof. This is a special case of Theorem 4 in the Appendix.
Therefore, the parsing problem for a context-free grammar Γcan be framed in
GFG terms as follows: given a string w, determine if there are complete balanced paths
in GF G(Γ)that generate w(recognition), and if so, produce a representation of these
paths (parsing). If the grammar is unambiguous, each string in the language is generated
by exactly one such path.
The parsing techniques considered in this paper read the input string wfrom left to
right one symbol at a time, and determine reachability along certain paths starting at
S. These paths are always prefixes of complete balanced paths, and if a prefix uof w
has been read up to that point, all these paths generate u. For the technical development,
similar paths are needed even though they begin at nodes other than S. Intuitively, these
call-return paths (CR-paths for short) are just segments of complete balanced paths;
they may contain unmatched call and return nodes, but they do not have mismatched
call and return nodes, so they can always be extended to complete balanced paths.
Definition 6. Given a GFG, a CR-sequence is a sequence of call and return nodes
that does not contain a subsequence vc, K, vrwhere vccall,Kis balanced, vr
return, and vcand vrare not matched.
Definition 7. A GFG path is said to be a CR-path if its subsequence of call and return
nodes is a CR-sequence.
Unless otherwise specified, the origin of a CR-path will be assumed to be S, the
case that arises most frequently.
A Graphical Model for Context-free Grammar Parsing 7
2.3 Nondeterministic GFG automaton (NGA)
NGA configuration (P C ×C×K), where:
Program counter P C V(Γ)(a state of the finite control)
Partially-read input strings CT×T
(C= (u, v), where prefix uof input string w=uv has been read)
Stack of return nodes KVR(Γ), where VR(Γ)is the set of return nodes
Initial Configuration: <S, [ ],w>
Accepting configuration: <S,[ ], w>
Transition function:
CALL <AαBγ , C, K> 7−<B, C, (AαBγ, K)>
START <B, C, K > 7−<Bβ, C, K> (nondeterministic choice)
EXIT <Bβ, C, K > 7−<B, C, K>
END <B, C, (AαBγ, K)>7−<AαBγ , C, K>
SCAN <Aαtγ, utv , K> 7−<Aαtγ, utv, K >
Fig. 2. Nondeterministic GFG Automaton (NGA)
Figure 2 specifies a push down automaton (PDA), called the nondeterministic GFG
automaton (NGA), that traverses complete balanced paths in a GFG under the control of
the input string. To match call’s with return’s, it uses a stack of “return addresses” as is
done in implementations of procedural languages. The configuration of the automaton is
a three-tuple consisting of the GFG node where the automaton currently is (this is called
the P C), a stack of return nodes, and the partially read input string. The symbol 7−
denotes a state transition.
The NGA begins at Swith the empty stack. At a call node, it pushes the matching
return node on the stack. At a start node, it chooses the production nondeterministi-
cally. At an end node, it pops a return node from the stack and continues the traversal
from there. If the input string is in the language generated by the grammar, the automa-
ton will reach Swith the empty stack (the end rule cannot fire at Sbecause the stack
is empty). We will call this a nondeterministic GFG automaton or NGA for short. It is
a special kind of pushdown automaton (PDA). It is not difficult to prove that the NGA
accepts exactly those strings that can be generated by some complete balanced path in
GF G(Γ)whence, by Theorem 1, the NGA accepts the language of Γ. (Technically,
acceptance is by final state [9], but it is easily shown that the final state Scan only be
reached with an empty stack.)
The nondeterminism in the NGA is called globally angelic nondeterminism [3] be-
cause the nondeterministic transitions at start nodes have to ensure that the NGA ul-
timately reaches Sif the string is in the language generated by the grammar. The
8 Keshav Pingali and Gianfranco Bilardi
recognition algorithms described in this paper are concerned with deterministic imple-
mentations of the globally angelic nondeterminism in the NGA.
2.4 Relationship between NFA and GFG for regular grammars
Every regular grammar is also a context-free grammar, so a regular grammar has two
graphical representations, an NFA and a GFG. A natural question is whether there is a
connection between these graphs. We show that applying the continuation-passing style
(CPS) optimization [13,18] to the NGA of a context-free grammar that is a right-linear
regular grammar3produces an NFA for that grammar.
For any context-free grammar, consider a production AαB in which the last sym-
bol on the righthand side is a non-terminal. The canonical NGA in Figure 2 will push
the return node AαBbefore invoking B, but after returning to this exit node, the
NGA just transitions to Aand pops the return node for the invocation of A. Had a
return address not been pushed when the call to Bwas made, the NGA would still rec-
ognize the input string correctly because when the invocation of Bcompletes, the NGA
would pop the return stack and transition directly to the return node for the invocation
of A. This optimization is similar to the continuation-passing style (CPS) transforma-
tion, which is used in programming language implementations to convert tail-recursion
to iteration.
To implement the CPS optimization in the context of the GFG, it is useful to intro-
duce a new type of node called a no-op node, which represents a node at which the NGA
does nothing other than to transition to the successor of that node. If a production for a
non-terminal other than Sends with a non-terminal, the corresponding call is replaced
with a no-op node; since the NGA will never come back to the corresponding return
node, this node can be replaced with a no-op node as well. For a right-linear regular
grammar, there are no call or return nodes in the optimized GFG. The resulting GFG
is just an NFA, and it is a variation of the NFA that is produced by using the standard
algorithms for producing an NFA from a right-linear regular grammar [9].
3 Parsing of general context-free grammars
General context-free grammars can be parsed using an algorithm due to Earley [6].
Described using derivations, the algorithm is not very intuitive and seems unrelated
to other parsing algorithms. For example, the monograph on parsing by Sippu and
Soisalon-Soininen [17] omits it, Grune and Jacobs’ book describes it as “top-down
restricted breadth-first bottom-up parsing” [8], and the “Dragon book” [1] mentions it
only in the bibliography as “a complex, general-purpose algorithm due to Earley that
tabulates LR-items for each substring of the input.” Cousot and Cousot use Galois con-
nections between lattices to show that Earley’s algorithm is an abstract interpretation of
a refinement of the derivation semantics of context-free grammars [5].
In contrast to these complicated narratives, a remarkably simple interpretation of
Earley’s algorithm emerges when it is viewed in terms of the GFG: Earley’s algorithm
3A right-linear regular grammar is a regular grammar in which the righthand side of a produc-
tion consists of a string of zero or more terminals followed by at most one non-terminal.
A Graphical Model for Context-free Grammar Parsing 9
.E
a
b
c
d
e
f
E.
g
h
i
j
k
l
(
+
)
int
.S
p
q
S.
<.S,0> <p,0> <.E,0>
<a,0> <g,0> <i,0>
<h,0> <E.,0>
<j,0> <q,0>
<S.,0>
<k,0>
<.E,2>
<a,2> <g,2> <i,2>
7
+
<h,2> <E.,2>
<j,2> <l,0> <E.,0>
<q,0> <j,0> <S.,0>
<k,2><k,0>
<.E,4>
<a,4><g,4><I,4>
8
+
9
<h,4><E.,4>
<l,0><E.,0><q,0><S.,0>
<l,2> <E.,2><j,2>
<j,0>
h E.
q c e j l
S.
d k
.E
a g i
7
+
.S p .E
a g i
(a) GFG for expression grammar (b) NFA reachability (c) Earley parser
h E.
c e j l
q S.
d k
.E
ag i
8
+
9
hE.
qc e j l
S.
.7+8+9
7.+8+9
7+.8+9
7+8.+9
7+8+.9
7+8+9.
Σ0
Σ2
Σ3
Σ4
Σ5
Σ1
Σ0
Σ2
Σ3
Σ4
Σ5
Σ1
Ci
+
Fig. 3. Earley parser: example.
is the context-free grammar analog of the well-known simulation algorithm for non-
deterministic finite-state automata (NFA) [1]. While the latter tracks reachability along
prefixes of complete paths, the former tracks reachability along prefixes of complete
balanced paths.
3.1 NFA simulation algorithm
As a step towards Earley’s algorithm, consider interpreting the GFG as an NFA (so non-
deterministic choices are permitted at both start and end nodes). The NFA simulation
on a given an input word w[1..n]can be viewed as the construction of a sequence of
node sets Σ0, ..., Σn. Here, Σ0is the -closure of {S}. For i= 1, . . . , n, set Σiis the
10 Keshav Pingali and Gianfranco Bilardi
(a) NFA Simulation of GFG
Sets of GFG nodes Σ:P(V(Γ))
Partially-read input strings C:T×T
Recognizer configurations (Σ×C)+
Acceptance: SΣ|w|
INIT (SΣ0)(C0=w)CA LL AαBγ Σj
BΣj
START
BΣj
BβΣj
EX IT BβΣj
BΣj
EN D BΣj
AαBγΣj
SC AN AαΣjCj=utv
(AαtγΣj+1)(Cj+1 =utv)
(b) Earley recognizer
Non-negative integers: N
Sets of tagged GFG nodes Σ:P(V(Γ)× N )
Partially-read input strings C:T×T
Recognizer configurations (Σ×C)+
Acceptance: <S,0>Σ|w|
INIT (<S, 0>Σ0)(C0=w)CA LL <AαB γ, i> Σj
<B, j > Σj
START <B , j> Σj
<Bβ , j> Σj
EX IT <Bβ, k> Σj
<B, k> Σj
EN D <B, k > Σj<AαBγ, i> Σk
<AαBγ , i> Σj
SC AN <Aαtγ , i> ΣjCj=utv
(<Aαtγ, i> Σj+1 )(Cj+1 =utv)
(c) Earley parser
Non-negative integers: N
Program counter P C:V(Γ)× N
Stack of call nodes K:VR(Γ)
Parser configurations: (P C × N × K)
Acceptance: final configuration is <<S, 0>, 0,[ ]>
INIT <<S,0>, |w|,[ ]>
CALL1<<B, j >, j, (<AαBγ, i>, K )>7−<<AαBγ , i>, j, K>
START1<<Bβ, j >, j, K> 7−<<B, j >, j, K>
EXIT1<<B, k>, j, K > 7−<<Bβ, k>, j, K >
if (<Bβ, k>Σj)(nondeterminism)
END1<<AαBγ , i>, j, K> 7−<<B, k>, j, (<AαB γ, i>, K)>
if (<B, k>Σjand <AαB γ, i>Σk)(nondeterminism)
SCAN1<<Aαtγ, i>, (j+ 1), K > 7−<<Aαtγ , i>, j, K>
Fig. 4. NFA, Earley recognizer, and Earley parser: input word is w
A Graphical Model for Context-free Grammar Parsing 11
-closure of the set of nodes reachable from nodes in Σi1by scan edges labeled w[i].
The string wis in the language recognized by the NFA if and only if SΣn.
Figure 3(a) shows the GFG of Figure 1, but with simple node labels. Figure 3(b)
illustrates the behavior of the NFA simulation algorithm for the input string “7+8+9”.
Each Σiis associated with a terminal string pair Ci=u.v, which indicates that prefix u
of the input string w=uv has been read up to that point.
The behavior of this NFA -closure algorithm on a GFG is described concisely by
the rules shown in Figure 4(a). Each rule is an inference rule or constraint; in some
rules, the premises have multiple consequents. It is straightforward to use these rules
to compute the smallest Σ-sets that satisfy all the constraints. The INIT rule enters S
into Σ0. Each of the other rules is associated with traversing a GFG edge from the node
in its assumption to the node in its consequence. Thus, the CALL, START, END, and
EXIT rules compute the -closure of a Σ-set; notice that the END rule is applied to all
outgoing edges from END nodes.
3.2 Earley’s algorithm
Like the NFA -closure algorithm, Earley’s algorithm builds Σsets, but it computes
reachability only along CR-paths starting at S. Therefore, the main difference between
the two algorithms is at end nodes: a CR-path that reaches an end node should be
extended only to the return node corresponding to the last unmatched call node on
that path.
One way to find this call node is to tag each start node with a unique ID (tag)
when it is entered into a Σ-set, and propagate this tag through the nodes of productions
for this non-terminal all the way to the end node. At the end node, this unique ID can
be used to identify the Σ-set containing corresponding start node. The last unmatched
call node on the path must be contained in that set as well, and from that node, the
return node to which the path should be extended can easily be determined.
To implement the tag, it is simple to use the number of the Σ-set to which the start
node is added, as shown in Figure 4(b). When the CALL rule enters a start node into
aΣset, the tag assigned to this node is the number of that Σset. The END rule is
the only rule that actually uses tags; all other rules propagate tags. If <B, k> Σj,
then the matching start and call nodes are in Σk, so Σkis examined to determine
which of the immediate predecessors of node Boccur in this set. These must be call
nodes of the form Aα, so the matching return nodes AαBγare added to Σj
with the tags of the corresponding call nodes. For a given grammar, this can easily be
done in time constant with respect to the length of the input string. A string is in the
language recognized by the GFG iff Σncontains <S,0>. Figure 3(c) shows the Σ
sets computed by the Earley algorithm for the input string “7+8+9”.
We discuss a small detail in using the rules of Figures 4(a,b) to construct Σ-sets for
a given GFG and input word. The existence of a unique smallest sequence of Σ-sets can
be proved in many ways, such as by observing that the rules have the diamond property
and are strongly normalizing [19]. A canonical order of rule application for the NFA
rules is the following. We give a unique number to each GFG edge, and associate the
index hj, miwith a rule instance that corresponds to traversing edge mand adding the
destination node to Σj; the scheduler always pick the rule instance with the smallest
12 Keshav Pingali and Gianfranco Bilardi
index. This order completes the Σsets in sequence, but many other orders are possible.
The same order can be used for the rules in Figure 4(b) except that for the END rule,
we use the number on the edge (B,AαBγ).
Correctness of the rules of Figure 4(b) follows from Theorem 2.
Theorem 2. For a grammar Γ=<N, T, P , S> and an input word w,< S,0>Σ|w|
iff wis a word generated by grammar Γ.
Proof. See Section A.2.
The proof of Theorem 2 shows the following result, which is useful as a charac-
terization of the contents of Σsets. Let w[i..j]denote the substring of input wfrom
position ito position jinclusive if ij, and let it denote if i>j. It is shown that
<Aαβ, i> Σjiff there is a CR-path P:S A (Aαβ)such that
1. S Agenerates w[1..i], and
2. A (Aαβ)is balanced and generates w[(i+ 1)..j].
Like the NFA algorithm, Earley’s algorithm determines reachability along certain
paths but does not represent paths explicitly. Both algorithms permit such implicitly
maintained paths to share “sub-paths”: in Figure 3(c), Ein Σ1is reached by two CR-
paths, Q1: (S p E g h E), and Q2: (S p E i
E g h E), and they share the sub-path (E g h E). This path
sharing permits Earley’s algorithm to run in O(|w|3)time for any grammar (improved to
O(|w|3/log|w|)by Graham et al [7]), and O(|w|2)time for any unambiguous grammar,
as we show in Theorem 3.
Theorem 3. For a given GFG G= (V, E)and input word w, Earley’s algorithm re-
quires O(|w|2)space and O(|w|3)time. If the grammar is unambiguous, the time com-
plexity is reduced to O(|w|2).
Proof. See Section A.2
Earley parser The rules in Figure 4(b) define a recognizer. To get a parser, we need a
way to enumerate a representation of the parse tree, such as a complete, balanced GFG
path, from the Σsets; if the grammar is ambiguous, there may be multiple complete,
balanced paths that generate the input word.
Figure 4(c) shows a state transition system that constructs such a path in reverse;
if there are multiple paths that generate the string, one of these paths is reconstructed
non-deterministically. The parser starts with the entry <S,0>in the last Σset, and
reconstructs in reverse the inference chain that produced it from the entry <S, 0>in
Σ0; intuitively, it traverses the GFG in reverse from Sto S, using the Σset entries to
guide the traversal. Like the NGA, it maintains a stack, but it pushes the matching call
node when it traverses a return node, and pops the stack at a start node to determine
how to continue the traversal.
The state of the parser is a three-tuple: a Σset entry, the number of that Σset, and
the stack. The parser begins at <S,0>in Σnand an empty stack. It terminates when
A Graphical Model for Context-free Grammar Parsing 13
it reaches <S, 0>in Σ0. The sequence of GFG nodes in the reverse path can be output
during the execution of the transitions. It is easy to output other representations of parse
trees if needed; for example, the parse tree can be produced in reverse post-order by
outputting the terminal symbol or production name whenever a scan edge or exit node
respectively is traversed in reverse by the parser.
To eliminate the need to look up Σsets for the EXIT1and END1rules, the rec-
ognizer can save information relevant for the parser in a data structure associated with
each Σset. This data structure is a relation between the consequent and the premise(s)
of each rule application; given a consequent, it returns the premise(s) that produced that
consequent during recognition. If the grammar is ambiguous, there may be multiple
premise(s) that produced a given consequent, and the data structure returns one of them
non-deterministically. By enumerating these non-deterministic choices, it is possible to
enumerate different parse trees for the given input string. Note that if the grammar is
cyclic (that is, A+
Afor some non-terminal A), there may be an infinite number of
parse trees for some strings.
3.3 Discussion
In Earley’s paper, the call and start rules were combined into a single rule called
prediction, and the exit and end rules were combined into a single rule called comple-
tion [6]. Aycock and Horspool pre-compute some of the contents of Σ-sets to improve
the running time in practice [2].
Erasing tags from the rules in Figure 4(b) for the Earley recognizer produces the
rules for the NFA -closure algorithm in Figure4(a). The only nontrivial erasure is for
the end rule: k, the tag of the tuple <B, k>, becomes undefined when tags are deleted,
so the antecedent <AαBγ , i> Σkfor this rule is erased. Erasure of tags demon-
strates lucidly the close and previously unknown connection between the NFA -closure
algorithm and Earley’s algorithm.
4 Preprocessing the GFG: look-ahead
Preprocessing the GFG is useful when many strings have to be parsed since the invest-
ment in preprocessing time and space is amortized over the parsing of multiple strings.
Look-ahead computation is a form of preprocessing that permits pruning of the set of
paths that need to be explored for a given input string.
Given a CR-path Q:S vwhich generates a string of terminals u, consider the
set of all strings of kterminals that can be encountered along any CR extension of Q.
When parsing a string u`z with `Tk, extensions of path Qcan be safely ignored if `
does not belong to this set. We call this set the context-dependent look-ahead set at vfor
path Q, which we will write as CDLk(Q)(in the literature on program optimization,
Qis called the calling context for its last node v). LL(k) and LR(k) parsers use context-
dependent look-ahead sets.
We note that for pruning paths, it is safe to use any superset of CDLk(Q): larger
supersets may be easier to compute off-line, possibly at the price of less pruning on-
line. In this spirit, a widely used superset is F OLLOWk(v), associated with GFG node
14 Keshav Pingali and Gianfranco Bilardi
v, which we call the context-independent look-ahead set. It is the union of the sets
CDLk(Q), over all CR-paths Q:S v. Context-independent look-ahead is used
by SLL(k) and SLR(k) parsers. It has also been used to enhance Earley’s algorithm.
Look-ahead sets intermediate between CDLk(Q)and F OLLO Wk(v)have also been
exploited, for example in LALR(k) and LALL(k) parsers [17].
The presentation of look-ahead computations algorithms is simplified if, at every
stage of parsing, there is always a string `of ksymbols that has not yet been read. This
can be accomplished by (i) padding the input string wwith k$ symbols to form w$k,
where $/(T+N)and (ii) replacing Γ=<N, T , P, S>, with the augmented grammar
Γ0=<N0=N∪ {S0}, T 0=T∪ {$}, P 0=P∪ {S0S$k}, S0>.
Figure 5(a) shows an example using a stylized GFG, with node labels omitted for
brevity. The set F OLLOW2(v)is shown in braces next to node v. If the word to be
parsed is ybc, the parser can see that yb /F O LLOW2(v)for v=SyLab, so it can
avoid exploration downstream of that node.
The influence of context is illustrated for node v=La, in Figure 5(a). Since the
end node Lis reached before two terminal symbols are encountered, it is necessary to
look beyond node L, but the path relevant to look-ahead depends on the path that was
taken to node L. If the path taken was Q:S0 (SyLab) (L) La, then
relevant path for look-ahead is L (SyLab) S0, so that CDL2(Q) = {aa}.
If the path taken was R:S0 (SyLbc) (L) La, then the relevant path
for look-ahead is (L) (SyLbc) S0, and CDL2(R) = {ab}.
We define these concepts formally next.
Definition 8. Context-dependent look-ahead: If vis a node in the GFG of an aug-
mented grammar Γ0=<N0, T 0, P 0, S 0>, the context-dependent look-ahead CDLk(Q)
for a CR-path Q:S0 vis the set of all k-prefixes of strings generated by paths
Qs:v S0where Q+Qsis a complete CR-path.
Definition 9. Context-independent look-ahead: If vis a node in the GF G for an aug-
mented grammar Γ0=<N0, T 0, P 0, S 0>,F OLLOWk(v)is the set of all k-prefixes of
strings generated by CR-paths v S0.
As customary, we let FO LLOWk(A)and F OLLOW (A)respectively denote F OLLOWk(A)
and F OLLOW1(A).
The rest of this section is devoted to the computation of look-ahead sets. It is con-
venient to introduce the function s1+ks2of strings s1and s2, which returns their
concatenation truncated to ksymbols. In Definition 10, this operation is lifted to sets of
strings.
Definition 10. Let Tdenote the set of strings of symbols from alphabet T.
For ET,(E)kis set of k-prefixes of strings in E.
For E, F T,E+kF= (E+F)k.
If E={, t, tu, abc}and F={, x, xy, xya},(E)2={, t, tu, ab}and (F)2={, x, xy }.
E+2F=(E+F)2={, x, xy, t, tx, tu, ab}. Lemma 1(a) says that concatenation fol-
lowed by truncation is equivalent to “pre-truncation” followed by concatenation and
A Graphical Model for Context-free Grammar Parsing 15
L
.S
L
y y
a b
b c
.L
a
{aa,ab}
{ab,bc}
S yLab | yLbc | M
L a | ε
M MM|x
{ya} {ya,yb}
x
M
.M
M
M
FIRST2(S) = (y+2 FIRST2(L)+2 {ab}) U (y+2 FIRST2(L)+2 {bc} ) U FIRST2(M)
FIRST2(L) = {a} U {ε}
FIRST2(M) = {x} U (FIRST2(M)+2FIRST2(M))
Solution:
FIRST2(S) = {ya,yb,x,xx} FIRST2(M) = {x,xx} FIRST2(L) = {a,ε}
FOLLOW2(S) = {$$}
FOLLOW2(L) = {ab} U {bc}
FOLLOW2(M) = FOLLOW2(S) U (FIRST2(M) +2 FOLLOW2(M)) U FOLLOW2(M))
Solution:
FOLLOW2(S) = {$$} FOLLOW2(L) = {ab,bc} FOLLOW2(M) = {$$,xx,x$}
{x$,xx}
{x$,xx} {x$,xx}
S.
L.
M.
$k
(a) FIRST2and FOLLOW2computation
(b) Partial and full 2-look-ahead cloning
Fig. 5. Look-head computation example
16 Keshav Pingali and Gianfranco Bilardi
truncation; this permits look-ahead computation algorithms to work with strings of
length at most kthroughout the computation rather than with strings of unbounded
length truncated to konly at the end.
Lemma 1. Function +khas the following properties.
(a) E+kF= (E)k+k(F)k.
(b) +kis associative and distributes over set union.
4.1 Context-independent look-ahead
F OLLOWk(v)can be computed by exploring CR-paths from vto S0. However, for
the “bulk” problem of computing these sets for many GFG nodes, such as all entry
nodes in a GFG, coordination of path explorations at different nodes can yield greater
efficiency.
Although we do not use this approach directly, the GFG permits F OLLOWkcom-
putation to be viewed as an inter-procedural backward dataflow analysis problem [14].
Dataflow facts are possible F OLLOWksets, which are the subsets of Tk, and which
form a finite lattice under subset ordering (the empty set is the least element). For an
edge ewith label t, the dataflow transfer function Fe(X)is {t}+kX(for edges, this
reduces to the identity function as expected). For a path Qwith edges labeled t1, ...tn,
the composite transfer function is ({t1}+k({t2}+k...({tn}+kX)), which can be
written as ({t1}+k{t2}+k...{tn}) +kXbecause +kis associative. If we denote the
k-prefix of the terminal string generated by Qby F I RSTk(Q), the composite trans-
fer function for a path Qis F I RSTk(Q) +kX. The confluence operator is set union.
To ensure that dataflow information is propagated only along (reverse) CR-paths, it is
necessary to find inter-procedural summary functions that permit look-ahead sets to be
propagated directly from a return node to its matching call node. These summary func-
tions are hard to compute for most dataflow problems but this is easy for F OLLOWk
computation because the lattice Lis finite, the transfer functions distribute over set
union, and the +koperation is associative. For a non-terminal A, the summary function
is FA(X) = F I RSTk(A) +kX, where F I RSTk(A)is the set of k-prefixes of termi-
nal strings generated by balanced paths from Ato A. The F I RSTkrelation can be
computed efficiently as described in Section 4.1. This permits the use of the functional
approach to inter-procedural dataflow analysis [14] to solve the F OLLOWkcomputa-
tion problem (the development below does not rely on any results from this framework).
F IRSTkcomputation For Γ=<N , T, P, S >,FI RSTk(A)for ANis defined
canonically as the set of k-prefixes of terminal strings derived from A[17]. This is
equivalent to the following, as we show in Theorem 4.
Definition 11. Given a grammar Γ=<N, T , P, S>, a positive integer kand AN,
F I RSTk(A)is the set of k-prefixes of terminal strings generated by balanced paths
from Ato A.
Following convention, we write F IRS T (A)to mean F IRS T1(A).
A Graphical Model for Context-free Grammar Parsing 17
Definition 12. F I RSTkis extended to a string u1u2...un(NT)as follows.
F I RSTk() = {}
F I RSTk(tT) = {t}
F I RSTk(u1u2...un) = F I RSTk(u1) +k... +kF I RSTk(un)
F I RSTksets for non-terminals can be computed as the least solution of a system
of equations derived from the grammar.
Algorithm 1 For Γ=<N, T , P, S> and positive integer k, let Mbe the finite lattice
whose elements are sets of terminal strings of length at most k, ordered by containment
with the empty set being the least element. The F I RSTksets for the non-terminals are
given by the least solution in Mof this equational system:
AN F I RSTk(A) = [
Aα
F I RSTk(α)
Figure 5(a) shows an example.
F OLLOWkcomputation
Algorithm 2 Given an augmented grammar Γ0=<N0, T 0, P 0, S 0>and positive inte-
ger k, let Lbe the lattice whose elements are sets of terminal strings of length k, or-
dered by containment with the empty set being the least element. The F OLLOWksets
for non-terminals other than S0are given by the least solution of this equational system:
F OLLOWk(S) = {$k}
BN− {S, S0}.F O LLOWk(B) = [
AαBγ
F I RSTk(γ) +kF OLLOWk(A)
Given F OLLOWksets for non-terminals, F OLLOWksets at all GFG nodes are
computed by interpolation:
F OLLOWk(Aαβ) = F I RSTk(β) +kF O LLOWk(A).
Figure 5(a) shows an example. Moccurs in three places on the righthand sides of
the grammar productions, so the righthand side of the equation for F OLLOWk(M)
is the union of three sets: the first from SM, the second from MMM, and the
third from MMM .
Using context-independent look-ahead in the Earley parser Some implementations
of Earley’s parser use a context-independent look-ahead of one symbol at start nodes
and end nodes (this is called prediction look-ahead and completion look-ahead respec-
tively) [6]. The practical benefit of using look-ahead in the Earley parser has been de-
bated in the literature. The implementation of Graham et al does not use look-ahead [7];
other studies argue that some benefits accrue from using prediction look-ahead [8]. Pre-
diction look-ahead is implemented by modifying the START rule in Figure 4(b):the
production Bβis explored only if βmight produce the empty string or a string that
18 Keshav Pingali and Gianfranco Bilardi
starts with the first look-ahead symbol. For this, the following formula is added to the
antecedents of the START rule: (F IR ST (β)) (Cj=u.tv tF IR ST (β)).
Completion look-ahead requires adding the following check to the antecedents of
the END rule in Figure 4(b):
(Cj=u.tv)(tF I RST (γ)(F I RST (γ)tF OLLO W (A))).
4.2 Context-dependent look-ahead
LL(k) and LR(k) parsers use context-dependent k-look-ahead. As one would expect, ex-
ploiting context enables a parser to rule out more paths than if it uses context-independent
look-ahead. One way to implement context-dependent look-ahead for a grammar Γis
to reduce it to the problem of computing context-independent look-ahead for a related
grammar Γcthrough an operation similar to procedure cloning.
In general, cloning a non-terminal Ain a grammar Γcreates a new grammar in
which (i) non-terminal Ais replaced by some number of new non-terminals A1,A2,...Ac
(c2) with the same productions as A, and (ii) all occurrences of Ain the righthand
sides of productions are replaced by some Aj(1jc). Figure 5(b) shows the result
of cloning non-terminal Lin the grammar of Figure 5(a) into two new non-terminals
L1, L2. Cloning obviously does not change the language recognized by the grammar.
The intuitive idea behind the use of cloning to implement context-dependent look-
ahead is to create a cloned grammar that has a copy of each production in Γfor each
context in which that production may be invoked, so as to “de-alias” look-ahead sets.
In general, it is infeasible to clone a non-terminal for every one of its calling contexts,
which can be infinitely many. Fortunately, contexts with the same look-ahead set can be
represented by the same clone. Therefore, the number of necessary clones is bounded by
the number of possible k-look-ahead sets for a node, which is 2|T|k. Since this number
grows rapidly with k, cloning is practical only for small values of k, but the principle is
clear.
Algorithm 3 Given an augmented grammar Γ0=(N0, T 0, P 0, S 0), and a positive inte-
ger k,Tk(Γ0)is following grammar:
Nonterminals: {S0}∪{[A, R]|A(N0S0), R T0k}
Terminals: T’
Start symbol: S0
Productions:
S0αwhere S0αΓ0
all productions [A, R]Y1Y2...Ymwhere for some AX1X2X3...XmP0
Yi=Xiif Xiis a terminal, and
Yi= [Xi, F I RSTk(Xi+1 ...Xm) +kR]otherwise.
Therefore, to convert the context-dependent look-ahead problem to the context-
independent problem, cloning is performed as follows. For a given k, each non-terminal
Ain the original grammar is replaced by a set of non-terminals [A, R]for every RTk
(intuitively, Rwill end up being the context-independent look-ahead at [A, R]in the
cloned grammar). The look-ahead Ris then interpolated into each production of Ato
A Graphical Model for Context-free Grammar Parsing 19
determine the new productions as shown in Algorithm 3.Figure 5(b) shows the result
of full 2-look-ahead cloning of the grammar in Figure 5(a) after useless non-terminals
have been removed.
5 Related work
The connection between context-free grammars and procedure call/return in program-
ming languages was made in the early 1960’s when the first recursive-descent parsers
were developed. The approach taken in this paper is to formulate parsing problems as
path problems in the GFG, and the procedure call/return mechanism is used only to
build intuition.
In 1970, Woods defined a generalization of finite-state automata called recursive
transition networks (RTNs) [20]. Perlin defines an RTN as “..a forest of disconnected
transition networks, each identified by a nonterminal label. All other labels are terminal
labels. When, in traversing a transition network, a nonterminal label is encountered,
control recursively passes to the beginning of the correspondingly labeled transition
network. Should this labeled network be successfully traversed, on exit, control returns
back to the labeled calling node” [12]. The RTN was the first graphical representation
of context-free grammars, and all subsequent graphical representations including the
GFG are variations on this theme. Notation similar to GFG start and end nodes was
first introduced by Graham et al in their study of the Earley parser [7]. The RTN with
this extension is used in the ANTLR system for LL(*) grammars [10].
The key difference between RTNs and GFGs is in the interpretation of the graphical
representation. An interpretation based on a single locus of control that flows between
productions is adequate for SLL(k)/LL(k)/LL(*) languages but inadequate for handling
more general grammars for which multiple paths through the GFG must be followed, so
some notion of multiple threads of control needs to be added to the basic interpretation
of the RTN. For example, Perlin models LR grammars using a chart parsing strategy in
which portions of the transition network are copied dynamically [12]. In contrast, the
GFG is a single graph, and all parsing problems are formulated as path problems in this
graph; there is no operational notion of a locus of control that is transferred between
productions. In particular, the similarity between Earley’s algorithm and the NFA sim-
ulation algorithm emerges only if parsing problems are framed as path problems in a
single graph. We note that the importance of the distinction between the two viewpoints
was highlighted by Sharir and Pnueli in their seminal work on inter-procedural dataflow
analysis [14].
The logic programming community has explored the notion of “parsing as deduc-
tion” [11,15,16] in which the rules of the Earley recognizer in Figure 4(b) are consid-
ered to be inference rules derived from a grammar, and recognition is viewed as the
construction of a proof that a given string is in the language generated by that grammar.
The GFG shows that this proof construction can be interpreted as constructing complete
balanced paths in a graphical representation of the grammar.
An important connection between inter-procedural dataflow analysis and reacha-
bility computation was made by Yannakakis [21], who introduced the notion of CFL-
paths. Given a graph with labeled edges and a context-free grammar, CFL-paths are
20 Keshav Pingali and Gianfranco Bilardi
paths that generate strings recognized by the given context-free grammar. Therefore,
the context-free grammar is external to the graph, whereas the GFG is a direct repre-
sentation of a context-free grammar with labeled nodes (start and end nodes must be
known) and labeled edges. If node labels are erased from a GFG and CFL-paths for the
given grammar are computed, this set of paths will include all the complete balanced
paths but in general, it will also include non-CR-paths that happen to generate strings
in the language recognized by the context-free grammar.
6 Conclusions
In other work, we have shown that the GFG permits an elementary presentation of LL,
SLL, LR, SLR, and LALR grammars in terms of GFG paths. These results and the
results in this paper suggest that the GFG can be a new foundation for the study of
context-free grammars.
Acknowledgments: We would like to thank Laura Kallmeyer for pointing us to the
literature on parsing in the logic programming community, and Giorgio Satta and Lillian
Lee for useful discussions about parsing.
References
1. A. Aho, M. Lam, R. Sethi, , and J. Ullman. Compilers: principles, techniques, and tools.
Addison Wesley, 2007.
2. J. Aycock and N. Horspool. Practical Earley parsing. The Computer Journal, 45(6):620–630,
2002.
3. W. D. Clinger and C. Halpern. Alternative semantics for mccarthy’s amb. In Seminar on Con-
currency, Carnegie-Mellon University, pages 467–478, London, UK, UK, 1985. Springer-
Verlag.
4. T. Cormen, C. Leiserson, R. Rivest, and C. Stein, editors. Introduction to Algorithms. MIT
Press, 2001.
5. P. Cousot and R. Cousot. Parsing as abstract interpretation of grammar semantics. Theoret.
Comput. Sci., 290:531–544, 2003.
6. J. Earley. An efficient context-free parsing algorithm. Commun. ACM, 13(2):94–102, Feb.
1970.
7. S. L. Graham, W. L. Ruzzo, and M. Harrison. An improved context-free recognizer. ACM
TOPLAS, 2(3):415–462, July 1980.
8. D. Grune and C. Jacobs. Parsing Techniques: a practical guide. Springer-Verlag, 2010.
9. J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages,
and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston,
MA, USA, 2006.
10. T. Parr and K. Fisher. LL(*): the foundation of the ANTLR parser generator. In PLDI, 2011.
11. F. C. N. Pereira and D. Warren. Parsing as deduction. In 21st Annual Meeting of the As-
sociation for Computational Linguistics, pages 137–144, MIT, Cambridge, Massachusetts,
1983.
12. M. Perlin. LR recursive transition networks for Earley and Tomita parsing. In ACL ’91,
1991.
13. J. C. Reynolds. On the relation between direct and continuation semantics. In Proceedings of
the 2nd Colloquium on Automata, Languages and Programming, pages 141–156. Springer-
Verlag, 1974.
A Graphical Model for Context-free Grammar Parsing 21
14. M. Sharir and A. Pnueli. Program flow analysis: theory and applications, chapter Two
approaches to interprocedural dataflow analysis, pages 189–234. Prentice-Hall, 1981.
15. S. M. Shieber, Y. Schabes, and F. C. N. Pereira. Principles and implementation of deductive
parsing. Journal of Logic Programming, 24(1 and 2):3–36, 1995.
16. K. Sikkel. Parsing Schemata. Texts in Theoretical Computer Science. Springer-Verlag,
Berlin, Heidelberg, New York, 1997.
17. S. Sippu and E. Soisalon-Soininen. Parsing theory. Springer-Verlag, 1988.
18. G. Sussman and G. Steele. Scheme: An interpreter for extended lambda calculus. Technical
Report AI Memo 349, AI Lab, M.I.T., December 1975.
19. Terese. Term Rewriting Systems. Combridge University Press, 2003.
20. W. A. Woods. Transition network grammars for natural language analysis. Commun. ACM,
13(10), Oct. 1970.
21. M. Yannakakis. Graph-theoretic methods in database theory. In Principles of Database
Systems, 1990.
A Appendix
A.1 Derivations, parse trees and GFG paths
The following result connects complete balanced paths to parse trees.
Theorem 4. Let Γ=<N, T , P, S> be a context-free grammar and G=GF G(Γ)the
corresponding grammar flow graph. Let AN. There exists a balanced path from A
to Awith ncr call-return pairs that generates a string wTif and only if there exists
a parse tree for wwith nint =ncr + 1 internal nodes.
Proof. We proceed by induction on ncr . The base case, ncr = 0, arises for a produc-
tion Au1u2. . . urwhere each ujis a terminal. The GFG balanced path contains the
sequence of nodes
A, Au1u2. . . ur,...Au1u2. . . ur, A
The corresponding parse tree has a root with label Aand rchildren respectively labeled
u1, u2, . . . , ur(from left to right), with nint = 1 internal node. The string generated by
the path and derived from the tree is w=u1u2. . . ur.
Assume now inductively the stated property for paths with fewer than ncr call-return
pairs and trees with fewer than nint internal nodes. Let Qbe a path from Ato Awith
ncr call-return pairs. Let Au1u2. . . urbe the “top production” used by Q,i.e., the
second node on the path is Au1u2. . . ur. If ujN, then Qwill contain a segment
of the form
Au1. . . uj1uj. . . ur, Qj, Au1. . . ujuj+1 . . . ur
where Qjis a balanced path from ujto uj, generating some word wj. Let Tjbe a
parse tree for wjwith root labeled uj, whose existence follows by the inductive hypoth-
esis. If instead ujT, then Qwill contain the scan edge
(Au1. . . uj1uj. . . ur, Au1. . . ujuj+1 . . . ur)
22 Keshav Pingali and Gianfranco Bilardi
generating the word wj=uj. Let Tjbe a tree with a single node labeled wj=uj. The
word generated by Qis w=w1w2. . . wr. Clearly, the tree Twith a root labeled A
and rsubtrees equal (from left to right) to T1,T2,...,Trderives string w. Finally, it is
simple to show that Thas nint =ncr + 1 internal nodes.
The construction of a balanced path generating wfrom a tree deriving wfollows the
same structure.
A.2 Correctness and complexity of Earley’s algorithm
The following result is an “inductive version” of Theorem 2, which asserts the correct-
ness of the rules for the Earley parser.
Theorem 5. Consider the execution of Earley’s algorithm on input string w=a1a2. . . an.
Let zbe a GFG node and iand jbe integers such that 0ijn. The following
two properties are equivalent.
(A) The algorithm creates an entry <z, i> in Σj.
(B) There is a CR-path Q= (S)Q0z(represented as a sequence of GFG nodes begin-
ning at Sand ending at z) that generates a1a2. . . ajand whose prefix preceding the
last unmatched call edge generates a1a2. . . ai.
Proof. Intuitively, the key fact is that each rule of Earley’s algorithm (aside from ini-
tialization) uses an entry <y, i0>Σj0and a GFG edge (y, z )to create an entry
<z, i> Σj, where the dependence of iand jupon i0and j0depends on the type
of edge (y, z). For a return edge, a suitable entry <z0, k> Σi0is also consulted. In
essence, if a CR-path can be extended by an edge, then (and only then) the appropriate
rule creates the entry for the extended path. The formal proof is an inductive formula-
tion of this intuition and carries out a case analysis with respect to the type of edge that
extends the path.
Part I. BA(from CR-path to Earley entry). The argument proceeds by induction
on the length (number of edges) `of path Q.
- Base cases (`= 0,1).
The only path with no edges is Q= (S), for which i=j= 0. The INIT rule produces
the corresponding entry <S, 0>Σ0. The paths with just one edge are also easily
dealt with, as they are of the form Q= (S)(Sσ), that is, they contain one ENTRY
edge.
- Inductive step (from `11to `).
Consider a CR-path Q= (S)Ryz of length `. It is straightforward to check that Q0=
(S)Ry is also a CR-path, of length `1. Hence, by the inductive hypothesis, an entry
<y, i0>is created by the algorithm in some Σj0, with Q0generating a1a2. . . aj0and
with the prefix of Q0preceding its last unmatched call edge generating a1a2. . . ai0.
Inspection of the rules for the Earley parser in Figure 4 reveals that, given <y, i0>
Σj0and given the presence edge (y, z)in the CR-path Q, an entry <z, i> Σjis
always created by the algorithm. It remains to show that iand jhave, with respect to
path Q, the relationship stated in property (B).
A Graphical Model for Context-free Grammar Parsing 23
- Frame number j. We observe that the string of terminals generated by Qis the
same as the string generated by Q0, except when (y, z)is a scan edge, in which case Q
does generate a1a2. . . aj0+1. Correspondingly, the algorithm sets j=j0, except when
(y, z)is a scan edge, in which case it sets j=j0+ 1.
- Tag i. We distinguish three cases, based on the type of edge.
When (y, z)is an entry,scan, or exit edge, Qhas the same last unmatched call edge
as Q0. Correspondingly, i=i0.
When (y, z)is a call edge, then (y, z)is the last unmatched call edge on Q. The
algorithm correctly sets i=j0=j.
Finally, let (y, z)be a return edge, with y=Band z=AαBγ. Since Qis
a CR-path, (y, z)must match the last unmatched call edge in Q0, say, (z0, y0), with
z0=Aα, and y0=B. We can then write Q= (S)Q1z0y0Q2yz where
Q2is balanced, whence Qand (S)Q1z0have the same last unmatched call edge,
say (u, v). Let i0be such that the prefix of Qending at z0generates a1a2. . . ai0
and let ki0be such that the prefix of Qending at ugenerates a1a2. . . ak. By
the inductive hypothesis, corresponding to path (S)Q1z0, the algorithm will have
created entry <z0=AαB γ, k> Σi0. From entries <y =B, i0>Σj0
and <z0=AαB γ, k> Σi0as well as from return edge (y, z ), the END rule
of the algorithm, as written in Figure 4, creates <z =AαBγ , i =i0>Σj.
Part II. AB(from Earley entry to CR-path). The argument proceeds by induction
on the number qof rule applications executed by the algorithm when entry <z, i> is
first added to Σj. (Further “discoveries” that <z, i> Σjare possible, but the entry is
added only once.)
- Base case (q= 1). The only rule applicable at first is INIT, creating the entry <S, 0>
Σ0, whose corresponding path is clearly Q= (S).
- Inductive step (from q11to q). Let the q-th rule application of the algorithm
be based on GFG edge (y, z)and on entry <y, i0>Σj0. Also let <z, i> Σjbe
the entry created by the algorithm as a result of said rule application. By the inductive
hypothesis, there is a CR-path (S)Q0ygenerating a1a2. . . aj0and with the prefix of
Q0preceding its last unmatched call edge generating a1a2. . . ai0. To show that to entry
<z, i> Σjthere corresponds a CR-path Qas in (B), we consider two cases, based
on the type of edge (y, z).
When (y, z)is an entry,scan,exit or call edge, we consider the path Q= (S)Q0yz.
Arguments symmetric to those employed in Part I of the proof show that path the Q
does satisfy property (B), with exactly the values iand jof the entry <z, i> Σj
produced by the algorithm.
When (y, z)is a return edge, the identification of path Qrequires more care. Let
y=Band z=AαBγ. The END rule of Earley’s algorithm creates entry
<z, i> Σjbased on two previously created entries to each of which, by the
inductive hypothesis, there corresponds a path, as discussed next.
To entry <y =B, k> Σj, there correspond a CR-path of the form Q0=
(S)Q0
1x0y0Q2y, with last unmatched call edge (x0, y0), where y0=Band Q2is
balanced.
24 Keshav Pingali and Gianfranco Bilardi
To entry <z0=AαB γ, i> Σkthere correspond a CR-path of the form
Q00 = (S)Q1z0, where z0=Aα.
From the above two paths, as well as from return edge (y, z), we can form a third
CR-path Q= (S)Q1z0y0Q2yz. We observe that is is legitimate to concatenate
(S)Q1z0with y0Q2yvia the call edge (z0, y0)since y0Q2yis balanced. It is also
legitimate to append return edge (y, z)to (S)Q1z0y0Q2y(thus obtaining Q), since
such edge does match (z0, y0), the last unmatched call edge of said path.
It is finally straightforward to check that the frame number jand the tag iare
appropriate for Q.
Proof of Theorem3 For a given GFG G= (V, E )and input word w, Earley’s algo-
rithm requires O(|w|2)space and O(|w|3)time. If the grammar is unambiguous, the
time complexity is reduced to O(|w|2).
Proof. Space complexity: There are |w|+ 1 Σ-sets, and each Σ-set can have at
most |V||w|elements since there are |w|+ 1 possible tags. Therefore, the space
complexity of the algorithm is O(|w|2).
Time complexity: For the time complexity, we need to estimate the number of dis-
tinct rule instances that can be invoked and the time to execute each one (intuitively,
the number of times each rule can “fire” and the cost of each firing).
For the time to execute each rule instance, we note that the only non-trivial rule is
the end rule: when <B, k> is added to Σj, we must look up Σkto find entries of
the form <AαBγ , i>. To permit this search to be done in constant time per en-
try, we maintain a data structure with each Σset, indexed by a non-terminal, which
returns the list of such entries for that non-terminal. Therefore, all rule instances
can be executed in constant time per instance.
We now compute an upper bound on the number of distinct rule instances for each
rule schema. The init rule schema has only one instance. The start rule schema has
a two parameters: the particular start node in the GFG at which this rule schema
is being applied and the tag j, and it can be applied for each outgoing edge of that
start node, so the number of instances of this rule is O(|V|∗|V|∗|w|); for a given
GFG, this is O(|w|).
Similarly, the end rule schema has four parameters: the particular end node in the
GFG, and the values of i, j, k; the relevant return node is determined by these
parameters. Therefore, an upper bound on the number of instances of this schema
is O(|V||w|3), which is O(|w|3)for a given GFG.
A similar argument shows that the complexity of call,exit and scan rule schema
instances is O(|w|2).
Therefore the complexity of the overall algorithm is O(|w|3).
Unambiguous grammar: As shown above, the cubic complexity of Earley’s algo-
rithm arises from the end rule. Consider the consequent of the end rule. The proof
of Theorem 2 shows that <AαBγ, i> Σjiff w[i..(j1)] can be derived
from αB. If the grammar is unambiguous, there can be only one such derivation;
considering the antecedents of the end rule, this means that for a given return node
AαBγand given values of iand j, there can be exactly one kfor which the an-
tecedents of the end rule are true. Therefore, for an unambiguous grammar, the end
A Graphical Model for Context-free Grammar Parsing 25
rule schema can be instantiated at most O(|w|2)times for a given grammar. Since
all other rules are bounded above similarly, we conclude that Earley’s algorithm
runs in time O(|w|2)for an unambiguous grammar.
A.3 Look-ahead computation
Proof of correctness of Algorithm 1:
Proof. The system of equations can be solved using Jacobi iteration, with FI RSTk(A) =
{} as the initial approximation for AN. If the sequence of approximate solutions for
the system is X0;X1;..., the set Xi[A](i1) contains k-prefixes of terminal strings
generated by balanced paths from Ato Ain which the number of call-return pairs
is at most (i1). Termination follows from monotonicity of set union and +k, and
finiteness of M.
Proof of correctness of Algorithm 2:
Proof. The system of equations can be solved using Jacobi iteration. If the sequence
of approximate solutions is X0;X1;..., then Xi[B](i1) contains the k-prefixes of
terminal strings generated by CR-paths from Bto S0in which there are ior fewer
unmatched return nodes.
Conference Paper
Recently, a new way to represent context-free grammars (CFG) has been put forward. The representation uses a directed graph called the Grammar Flow Graph (GFG). The GFG provides a framework to parse all context-free languages (CFL) based on Earley's algorithm that is easier to understand than the original Earley's presentation. In addition, the GFG connects two seemingly distant concepts for regular language parsing and parsing of CFLs. It shows that simulating all the moves through a non-deterministic finite-state automaton (NFA) can be generalized to moving along all appropriate GFG paths. This paper introduces a GFG parser generator whose functionalities are equivalent to popular parser generators like Yacc or Bison. However, it works on all CFGs-ambiguous or unambiguous. Our parser generator takes a grammar file that represents strings in CFL as an input and outputs a parser program to parse those strings. We evaluate it against two other parser generators, one based on the original Earley's representation and the other based on the Cocke-Younger-Kasami (CYK) algorithm. The result indicates that the performance of our parser generator is on a par with that based on the original Earley's scheme and superior to the CYK's parser generator. The code for our parser generator can be downloaded from the following link: https://bitbucket.org/kramatk/earleyparser.
Article
Full-text available
Inspired by ACTORS [7, 17], we have implemented an interpreter for a LISP-like language, SCHEME, based on the lambda calculus [2], but extended for side effects, multiprocessing, and process synchronization. The purpose of this implementation is tutorial. We wish to: 1.alleviate the confusion caused by Micro-PLANNER, CONNIVER, etc., by clarifying the embedding of non-recursive control structures in a recursive host language like LISP. 2.explain how to use these control structures, independent of such issues as pattern matching and data base manipulation. 3.have a simple concrete experimental domain for certain issues of programming semantics and style. This paper is organized into sections. The first section is a short “reference manual” containing specifications for all the unusual features of SCHEME. Next, we present a sequence of programming examples which illustrate various programming styles, and how to use them. This will raise certain issues of semantics which we will try to clarify with lambda calculus in the third section. In the fourth section we will give a general discussion of the issues facing an implementor of an interpreter for a language based on lambda calculus. Finally, we will present a completely annotated interpreter for SCHEME, written in MacLISP [13], to acquaint programmers with the tricks of the trade of implementing non-recursive control structures in a recursive language like LISP. This report describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for the laboratory's artificial intelligence research is provided in part by the Advanced Research Projects Agency of the Department of Defense under Office of Naval Research contract N00014-75-C-0643.
Article
A parsing algorithm which seems to be the most efficient general context-free algorithm known is described. It is similar to both Knuth's LR(k) algorithm and the familiar top-down algorithm. It has a time bound proportional to n3 (where n is the length of the string being parsed) in general; it has an n2 bound for unambiguous grammars; and it runs in linear time on a large class of grammars, which seems to include most practical context-free programming language grammars. In an empirical comparison it appears to be superior to the top-down and bottom-up algorithms studied by Griffiths and Petrick. © 1983, ACM. All rights reserved.
Conference Paper
Despite the power of Parser Expression Grammars (PEGs) and GLR, parsing is not a solved problem. Adding nondeterminism (parser speculation) to traditional LL and LR parsers can lead to unexpected parse-time behavior and introduces practical issues with error handling, single-step debugging, and side-effecting embedded grammar actions. This paper introduces the LL(*) parsing strategy and an associated grammar analysis algorithm that constructs LL(*) parsing decisions from ANTLR grammars. At parse-time, decisions gracefully throttle up from conventional fixed k>=1 lookahead to arbitrary lookahead and, finally, fail over to backtracking depending on the complexity of the parsing decision and the input symbols. LL(*) parsing strength reaches into the context-sensitive languages, in some cases beyond what GLR and PEGs can express. By statically removing as much speculation as possible, LL(*) provides the expressivity of PEGs while retaining LL's good error handling and unrestricted grammar actions. Widespread use of ANTLR (over 70,000 downloads/year) shows that it is effective for a wide variety of applications.
Article
The use of augmented transition network grammars for the analysis of natural language sentences is described. Structure-building actions associated with the arcs of the grammar network allow for the reordering, restructuring, and copying of constituents necessary to produce deep-structure representations of the type normally obtained from a transformational analysis, and conditions on the arcs allow for a powerful selectivity which can rule out meaningless analyses and take advantage of semantic information to guide the parsing. The advantages of this model for natural language analysis are discussed in detail and illustrated by examples. An implementation of an experimental parsing system for transition network grammars is briefly described.
Book
This 320-page book treats parsing in its own right, in greater depth than is found in most computer science and linguistics books. It offers a clear, accessible, and thorough discussion of many different parsing techniques with their interrelations and applicabilities, including error recovery techniques. Unlike most books, it treats (almost) all parsing methods, not just the popular ones. See Preface + Introduction and/or Table of Contents for a quick impression. The book features a 48 page...