Content uploaded by Keshav Pingali

Author content

All content in this area was uploaded by Keshav Pingali on Jul 01, 2017

Content may be subject to copyright.

A Graphical Model for Context-free Grammar Parsing

Keshav Pingali1and Gianfranco Bilardi2

1The University of Texas, Austin,

Texas 78712, USA.

Email: pingali@cs.utexas.edu

2Universit`

a di Padova

35131 Padova, Italy.

Email:bilardi@dei.unipd.it.

Abstract. In the compiler literature, parsing algorithms for context-free gram-

mars are presented using string rewriting systems or abstract machines such as

pushdown automata. Unfortunately, the resulting descriptions can be baroque,

and even a basic understanding of some parsing algorithms, such as Earley’s

algorithm for general context-free grammars, can be elusive. In this paper, we

present a graphical representation of context-free grammars called the Grammar

Flow Graph (GFG) that permits parsing problems to be phrased as path problems

in graphs; intuitively, the GFG plays the same role for context-free grammars

that nondeterministic ﬁnite-state automata play for regular grammars. We show

that the GFG permits an elementary treatment of Earley’s algorithm that is much

easier to understand than previous descriptions of this algorithm. In addition,

look-ahead computation can be expressed as a simple inter-procedural dataﬂow

analysis problem, providing an unexpected link between front-end and back-end

technologies in compilers. These results suggest that the GFG can be a new foun-

dation for the study of context-free grammars.

1 Introduction

The development of elegant and practical parsing algorithms for context-free grammars

is one of the major accomplishments of 20th century Computer Science. Two abstrac-

tions are used to present these algorithms: string rewriting systems and pushdown au-

tomata, but the resulting descriptions are unsatisfactory for several reasons.

2 Keshav Pingali and Gianfranco Bilardi

–Even an elementary understanding of some grammar classes requires mastering

a formidable number of complex concepts. For example, LR(k) parsing requires

an understanding of rightmost derivations, right sentential forms, viable preﬁxes,

handles, complete valid items, and conﬂicts, among other notions.

–Parsing algorithms for different grammar classes are presented using different ab-

stractions; for example, LL grammars are presented using recursive-descent, while

LR grammars are presented using shift-reduce parsers. This obscures connections

between different grammar classes and parsing techniques.

–Although regular grammars are a proper subset of context-free grammars, parsing

algorithms for regular grammars, which are presented using ﬁnite-state automata,

appear to be entirely unrelated to parsing algorithms for context-free grammars.

In this paper, we present a novel approach to context-free grammar parsing that is

based on a graphical representation of context-free grammars called the Grammar Flow

Graph(GFG). Intuitively, the GFG plays the same role for context-free grammars that

the nondeterministic ﬁnite-state automaton (NFA) does for regular grammars: parsing

problems can be formulated as path problems in the GFG, and parsing algorithms be-

come algorithms for solving these path problems. The GFG simpliﬁes and uniﬁes the

presentation of parsing algorithms for different grammar classes; in addition, ﬁnite-

state automata can be seen as an optimization of the GFG for the special case of regular

grammars, providing a pleasing connection between regular and context-free grammars.

Section 2 introduces the GFG, and shows how the GFG for a given context-free

grammar can be constructed in a straight-foward way. Membership of a string in the

language generated by the grammar can be proved by ﬁnding what we call a complete

balanced GFG path that generates this string. Since every regular grammar is also a

context-free grammar, a regular grammar has both a GFG and an NFA representation.

In Section 2.4, we establish a connection between these representations: we show that

applying the continuation-passing style (CPS) optimization [13,18] to the GFG of a

right-linear regular grammar produces an NFA that is similar to the NFA produced by

the standard algorithm for converting a right-linear regular grammar to an NFA.

Earley’s algorithm[6] for parsing general context-free grammars is one of the more

complicated parsing algorithms in the literature [1]. The GFG reveals that this algorithm

is a straightforward extension of the well-known “-closure” algorithm for simulating

all the moves of an NFA (Section 3). The resulting description is much simpler than

previous descriptions of this algorithm, which are based on dynamic programming,

abstract interpretation, and Galois connections [6,8,5].

Look-ahead is usually presented in the context of particular parsing strategies such

as SLL(1) parsing. In Section 4, we show that the GFG permits look-ahead computa-

tion to be formulated independently of the parsing strategy as a simple inter-procedural

dataﬂow analysis problem, unifying algorithmic techniques for compiler front-ends and

back-ends. The GFG also enables a simple description of parsers for LL and LR gram-

mars and their sub-classes such as SLL, SLR and LALR grammars, although we do not

discuss this in this paper.

Section 5 describes related work. Structurally, the GFG resembles the recursive

transition network (RTN) [20], which is used in natural language processing and parsers

like ANTLR [10], but there are crucial differences. In particular, the GFG is a single

A Graphical Model for Context-free Grammar Parsing 3

graph in which certain paths are of interest, not a collection of recursive state machines

with an operational model like chart parsing for their interpretation. Although motivated

by similar concerns, complete balanced paths are different from CFL-paths [21].

Proofs of the main theorems are given in the appendix.

2 Grammar Flow Graph (GFG) and complete balanced paths

A context-free grammar Γis a 4-tuple <N, T , P, S> where Nis a ﬁnite set of non-

terminals, Tis a ﬁnite set of terminals, P⊆N×(N∪T)∗is the set of productions,

and S∈Nis the start symbol. To simplify the development, we make the following

standard assumptions about Γthroughout this paper.

–A1: Sdoes not appear on the righthand side of any production.

–A2: Every non-terminal is used in a derivation of some string of terminals from S

(no useless non-terminals [1]).

Any grammar Γ0can be transformed in time O(|Γ0|)into an equivalent grammar Γ

satisfying the above assumptions [17]. The running example in this paper is this gram-

mar: E→int|(E+E)|E+E. An equivalent grammar is shown in Figure 1, where the

production S→Ehas been added to comply with A1.

2.1 Grammar ﬂow graph (GFG)

Figure 1 shows the GFG for the expression grammar. Some edges are labeled explicitly

with terminal symbols, and the others are implicitly labeled with . The GFG can be

understood by analogy with inter-procedural control-ﬂow graphs: each production is

represented by a “procedure” whose control-ﬂow graph represents the righthand side

of that production, and a non-terminal Ais represented by a pair of nodes •Aand A•,

called the start and end nodes for A, that gather together the control-ﬂow graphs for

the productions of that non-terminal. An occurrence of a non-terminal in the righthand

side of a production is treated as an invocation of that non-terminal.

The control-ﬂow graph for a production A→u1u2..urhas r+ 1 nodes. As in ﬁnite-

state automata, node labels in a GFG do not play a role in parsing and can be chosen ar-

bitrarily, but it is convenient to label these nodes A→•u1u2..urthrough A→u1u2..ur•;

intuitively, the •indicates how far parsing has progressed through a production (these la-

bels are related to items [1]). The ﬁrst and last nodes in this sequence are called the entry

and exit nodes for that production. If uiis a terminal, there is a scan edge with that label

from the scan node A→u1..ui−1•ui..urto node A→u1..ui•ui+1..ur, just as in ﬁnite-

state automata. If uiis a non-terminal, it is considered to be an “invocation” of that

non-terminal, so there are call and return edges that connect nodes A→u1..ui−1•ui..ur

to the start node of non-terminal uiand its end node to A→u1..ui•ui+1 ..ur.

Formally, the GF G for a grammar Γis denoted by GF G(Γ)and it is deﬁned as

shown in Deﬁnition 1. It is easy to construct the GFG for a grammar Γin O(|Γ|)time

and space using Deﬁnition 1.

Deﬁnition 1. If Γ=<N, T , P, S> is a context-free grammar, G=GF G(Γ)is the

smallest directed graph (V(Γ), E(Γ)) that satisﬁes the following properties.

4 Keshav Pingali and Gianfranco Bilardi

Fig. 1. Grammar Flow Graph example

For each non-terminal A∈N,V(Γ)contains nodes labeled •Aand A•, called the

start and end nodes respectively for A.

For each production A→,V(Γ)contains a node labeled A→•, and E(Γ)contains

edges (•A, A→•), and (A→•, A•).

For each production A→u1u2...ur

•V(Γ)contains (r+1) nodes labeled A→•u1...ur,A→u1•...ur, ..., A→u1...ur•,

•E(Γ)contains entry edge (•A, A→•u1...ur), and exit edge (A→u1...ur•, A•),

•for each ui∈T,E(Γ)contains a scan edge

(A→u1...ui−1•ui..ur, A→u1...ui•ui+1..ur)labeled ui,

•for each ui∈N,E(Γ)contains a call edge (A→u1...ui−1•ui...ur,•ui)and

areturn edge (ui•, A→u1...ui•ui+1...ur).

Node A→u1...ui−1•ui...uris a call node, and matches the return node

A→u1...ui•ui+1...ur.

Edges other than scan edges are labeled with .

When the grammar is obvious from the context, a GFG will be denoted by G=(V, E).

Note that start and end nodes are the only nodes that can have a fan-out greater than

one. This fact will be important when we interpret the GFG as a nondeterministic au-

tomaton in Section 2.3.

A Graphical Model for Context-free Grammar Parsing 5

Node type Description

start Node labeled •A

end Node labeled A•

call Node labeled A→α•Bγ

return Node labeled A→αB•γ

entry Node labeled A→•α

exit Node labeled A→α•

scan Node labeled A→α•tγ

Table 1. Classiﬁcation of GFG nodes: a node can belong to several categories. (A, B ∈N,

t∈T, and α, γ ∈(T+N)∗)

2.2 Balanced paths

The following deﬁnition is standard.

Deﬁnition 2. A path in a GFG G=(V, E )is a non-empty sequence of nodes v0, . . . , vl,

such that (v0, v1),(v1, v2), ..., (vl−1, vl)are all edges in E.

In a given GFG, the notation v1 v2denotes the edge from v1to v2, and the notation

v1 ∗vndenotes a path from v1to vn; the symbol “→” is reserved for productions and

derivations. If Q1:v1 ∗vmand Q2:vm ∗vrare paths in a GFG, the notation Q1+Q2

denotes the concatenation of paths Q1and Q2. In this spirit, we denote string concate-

nation by +as well. It is convenient to deﬁne the following terms to talk about certain

paths of interest in the GFG.

Deﬁnition 3. Acomplete path in a GFG is a path whose ﬁrst node is •Sand whose last

node is S•.

A path is said to generate the word wresulting from the concatenation of the labels

on its sequence of edges. By convention, w=for a path with a single node.

The GFG can be viewed as a nondeterministic ﬁnite-state automaton (NFA) whose

start state is •S, whose accepting state is S•, and which makes nondeterministic choices

at start and end nodes that have a fan-out more than one. Each complete GFG path

generates a word in the regular language recognized by this NFA. In Figure 1, the path

Q:•S S→•E •E E→•(E+E) E→(•E+E) •E E→•int

E→int• E• S→E• S•generates the string ”( int”. However, this string is not

generated by the context-free grammar from which this GFG was constructed.

To interpret the GFG as the representation of a context-free grammar, it is necessary

to restrict the paths that can be followed by the automaton. Going back to the intuition

that the GFG is similar to an inter-procedural call graph, we see that Qis not an inter-

procedurally valid path [14]: at E•, it is necessary to take the return edge to node

E→(E•+E)since the call of Ethat is being completed was made at node E→(•E+

E). In general, the automaton can make a free choice at start nodes just like an NFA,

but at end nodes, the return edge to be taken is determined by the call that is being

completed.

The paths the automaton is allowed to follow are called complete balanced paths

in this paper. Intuitively, if we consider matching call and return nodes to be opening

6 Keshav Pingali and Gianfranco Bilardi

and closing parentheses respectively of a unique color, the parentheses on a complete

balanced path must be properly nested [4]. In the formal deﬁnition below, if Kis a se-

quence of nodes, we let v, K, w represent the sequence of nodes obtained by prepending

node vand appending node wto K.

Deﬁnition 4. Given a GFG for a grammar Γ=<N, T , P, S>, the set of balanced se-

quences of call and return nodes is the smallest set of sequences of call and return

nodes that is closed under the following conditions.

–The empty sequence is balanced.

–The sequence (A→α•Bγ), K, (A→αB•γ)is balanced if Kis a balanced sequence,

and production (A→αBγ)∈P.

–The concatenation of two balanced sequences v1...vfand y1...ysis balanced if

vf6=y1. If vf=y1, the sequence v1...vfy2...ysis balanced.

This deﬁnition is essentially the same as the standard deﬁnition of balanced se-

quences of parentheses; the only difference is the case of vf=y1in the last clause,

which arises because a node of the form A→αX•Y β is both a return node and a call

node.

Deﬁnition 5. A GFG path v0 ∗vlis said to be a balanced path if its subsequence of

call and return nodes is balanced.

Theorem 1. If Γ=<N, T , P, S> is a context-free grammar and w∈T∗,wis in the

language generated by Γiff it is generated by a complete balanced path in GF G(Γ).

Proof. This is a special case of Theorem 4 in the Appendix.

Therefore, the parsing problem for a context-free grammar Γcan be framed in

GFG terms as follows: given a string w, determine if there are complete balanced paths

in GF G(Γ)that generate w(recognition), and if so, produce a representation of these

paths (parsing). If the grammar is unambiguous, each string in the language is generated

by exactly one such path.

The parsing techniques considered in this paper read the input string wfrom left to

right one symbol at a time, and determine reachability along certain paths starting at

•S. These paths are always preﬁxes of complete balanced paths, and if a preﬁx uof w

has been read up to that point, all these paths generate u. For the technical development,

similar paths are needed even though they begin at nodes other than •S. Intuitively, these

call-return paths (CR-paths for short) are just segments of complete balanced paths;

they may contain unmatched call and return nodes, but they do not have mismatched

call and return nodes, so they can always be extended to complete balanced paths.

Deﬁnition 6. Given a GFG, a CR-sequence is a sequence of call and return nodes

that does not contain a subsequence vc, K, vrwhere vc∈call,Kis balanced, vr∈

return, and vcand vrare not matched.

Deﬁnition 7. A GFG path is said to be a CR-path if its subsequence of call and return

nodes is a CR-sequence.

Unless otherwise speciﬁed, the origin of a CR-path will be assumed to be •S, the

case that arises most frequently.

A Graphical Model for Context-free Grammar Parsing 7

2.3 Nondeterministic GFG automaton (NGA)

NGA conﬁguration (P C ×C×K), where:

Program counter P C ∈V(Γ)(a state of the ﬁnite control)

Partially-read input strings C∈T∗×T∗

(C= (u, v), where preﬁx uof input string w=uv has been read)

Stack of return nodes K∈VR(Γ)∗, where VR(Γ)is the set of return nodes

Initial Conﬁguration: <•S, [ ],•w>

Accepting conﬁguration: <S•,[ ], w•>

Transition function:

CALL <A→α•Bγ , C, K> 7−→ <•B, C, (A→αB•γ, K)>

START <•B, C, K > 7−→ <B→•β, C, K> (nondeterministic choice)

EXIT <B→β•, C, K > 7−→ <B•, C, K>

END <B•, C, (A→αB•γ, K)>7−→ <A→αB•γ , C, K>

SCAN <A→α•tγ, u•tv , K> 7−→ <A→αt•γ, ut•v, K >

Fig. 2. Nondeterministic GFG Automaton (NGA)

Figure 2 speciﬁes a push down automaton (PDA), called the nondeterministic GFG

automaton (NGA), that traverses complete balanced paths in a GFG under the control of

the input string. To match call’s with return’s, it uses a stack of “return addresses” as is

done in implementations of procedural languages. The conﬁguration of the automaton is

a three-tuple consisting of the GFG node where the automaton currently is (this is called

the P C), a stack of return nodes, and the partially read input string. The symbol 7−→

denotes a state transition.

The NGA begins at •Swith the empty stack. At a call node, it pushes the matching

return node on the stack. At a start node, it chooses the production nondeterministi-

cally. At an end node, it pops a return node from the stack and continues the traversal

from there. If the input string is in the language generated by the grammar, the automa-

ton will reach S•with the empty stack (the end rule cannot ﬁre at S•because the stack

is empty). We will call this a nondeterministic GFG automaton or NGA for short. It is

a special kind of pushdown automaton (PDA). It is not difﬁcult to prove that the NGA

accepts exactly those strings that can be generated by some complete balanced path in

GF G(Γ)whence, by Theorem 1, the NGA accepts the language of Γ. (Technically,

acceptance is by ﬁnal state [9], but it is easily shown that the ﬁnal state S•can only be

reached with an empty stack.)

The nondeterminism in the NGA is called globally angelic nondeterminism [3] be-

cause the nondeterministic transitions at start nodes have to ensure that the NGA ul-

timately reaches S•if the string is in the language generated by the grammar. The

8 Keshav Pingali and Gianfranco Bilardi

recognition algorithms described in this paper are concerned with deterministic imple-

mentations of the globally angelic nondeterminism in the NGA.

2.4 Relationship between NFA and GFG for regular grammars

Every regular grammar is also a context-free grammar, so a regular grammar has two

graphical representations, an NFA and a GFG. A natural question is whether there is a

connection between these graphs. We show that applying the continuation-passing style

(CPS) optimization [13,18] to the NGA of a context-free grammar that is a right-linear

regular grammar3produces an NFA for that grammar.

For any context-free grammar, consider a production A→αB in which the last sym-

bol on the righthand side is a non-terminal. The canonical NGA in Figure 2 will push

the return node A→αB•before invoking B, but after returning to this exit node, the

NGA just transitions to A•and pops the return node for the invocation of A. Had a

return address not been pushed when the call to Bwas made, the NGA would still rec-

ognize the input string correctly because when the invocation of Bcompletes, the NGA

would pop the return stack and transition directly to the return node for the invocation

of A. This optimization is similar to the continuation-passing style (CPS) transforma-

tion, which is used in programming language implementations to convert tail-recursion

to iteration.

To implement the CPS optimization in the context of the GFG, it is useful to intro-

duce a new type of node called a no-op node, which represents a node at which the NGA

does nothing other than to transition to the successor of that node. If a production for a

non-terminal other than Sends with a non-terminal, the corresponding call is replaced

with a no-op node; since the NGA will never come back to the corresponding return

node, this node can be replaced with a no-op node as well. For a right-linear regular

grammar, there are no call or return nodes in the optimized GFG. The resulting GFG

is just an NFA, and it is a variation of the NFA that is produced by using the standard

algorithms for producing an NFA from a right-linear regular grammar [9].

3 Parsing of general context-free grammars

General context-free grammars can be parsed using an algorithm due to Earley [6].

Described using derivations, the algorithm is not very intuitive and seems unrelated

to other parsing algorithms. For example, the monograph on parsing by Sippu and

Soisalon-Soininen [17] omits it, Grune and Jacobs’ book describes it as “top-down

restricted breadth-ﬁrst bottom-up parsing” [8], and the “Dragon book” [1] mentions it

only in the bibliography as “a complex, general-purpose algorithm due to Earley that

tabulates LR-items for each substring of the input.” Cousot and Cousot use Galois con-

nections between lattices to show that Earley’s algorithm is an abstract interpretation of

a reﬁnement of the derivation semantics of context-free grammars [5].

In contrast to these complicated narratives, a remarkably simple interpretation of

Earley’s algorithm emerges when it is viewed in terms of the GFG: Earley’s algorithm

3A right-linear regular grammar is a regular grammar in which the righthand side of a produc-

tion consists of a string of zero or more terminals followed by at most one non-terminal.

A Graphical Model for Context-free Grammar Parsing 9

.E

a

b

c

d

e

f

E.

g

h

i

j

k

l

(

+

)

int

.S

p

q

S.

<.S,0> <p,0> <.E,0>

<a,0> <g,0> <i,0>

<h,0> <E.,0>

<j,0> <q,0>

<S.,0>

<k,0>

<.E,2>

<a,2> <g,2> <i,2>

7

+

<h,2> <E.,2>

<j,2> <l,0> <E.,0>

<q,0> <j,0> <S.,0>

<k,2><k,0>

<.E,4>

<a,4><g,4><I,4>

8

+

9

<h,4><E.,4>

<l,0><E.,0><q,0><S.,0>

<l,2> <E.,2><j,2>

<j,0>

h E.

q c e j l

S.

d k

.E

a g i

7

+

.S p .E

a g i

(a) GFG for expression grammar (b) NFA reachability (c) Earley parser

h E.

c e j l

q S.

d k

.E

ag i

8

+

9

hE.

qc e j l

S.

.7+8+9

7.+8+9

7+.8+9

7+8.+9

7+8+.9

7+8+9.

Σ0

Σ2

Σ3

Σ4

Σ5

Σ1

Σ0

Σ2

Σ3

Σ4

Σ5

Σ1

Ci

+

Fig. 3. Earley parser: example.

is the context-free grammar analog of the well-known simulation algorithm for non-

deterministic ﬁnite-state automata (NFA) [1]. While the latter tracks reachability along

preﬁxes of complete paths, the former tracks reachability along preﬁxes of complete

balanced paths.

3.1 NFA simulation algorithm

As a step towards Earley’s algorithm, consider interpreting the GFG as an NFA (so non-

deterministic choices are permitted at both start and end nodes). The NFA simulation

on a given an input word w[1..n]can be viewed as the construction of a sequence of

node sets Σ0, ..., Σn. Here, Σ0is the -closure of {•S}. For i= 1, . . . , n, set Σiis the

10 Keshav Pingali and Gianfranco Bilardi

(a) NFA Simulation of GFG

Sets of GFG nodes Σ:P(V(Γ))

Partially-read input strings C:T∗×T∗

Recognizer conﬁgurations (Σ×C)+

Acceptance: S•∈Σ|w|

INIT (•S∈Σ0)∧(C0=•w)CA LL A→α•Bγ ∈Σj

•B∈Σj

START

•B∈Σj

B→•β∈Σj

EX IT B→β•∈Σj

B•∈Σj

EN D B•∈Σj

A→αB•γ∈Σj

SC AN A→α•tγ ∈ΣjCj=u•tv

(A→αt•γ∈Σj+1)∧(Cj+1 =ut•v)

(b) Earley recognizer

Non-negative integers: N

Sets of tagged GFG nodes Σ:P(V(Γ)× N )

Partially-read input strings C:T∗×T∗

Recognizer conﬁgurations (Σ×C)+

Acceptance: <S•,0>∈Σ|w|

INIT (<•S, 0>∈Σ0)∧(C0=•w)CA LL <A→α•B γ, i> ∈Σj

<•B, j > ∈Σj

START <•B , j> ∈Σj

<B→•β , j> ∈Σj

EX IT <B→β•, k> ∈Σj

<B•, k> ∈Σj

EN D <B•, k > ∈Σj<A→α•Bγ, i> ∈Σk

<A→αB•γ , i> ∈Σj

SC AN <A→α•tγ , i> ∈ΣjCj=u•tv

(<A→αt•γ, i> ∈Σj+1 )∧(Cj+1 =ut•v)

(c) Earley parser

Non-negative integers: N

Program counter P C:V(Γ)× N

Stack of call nodes K:VR(Γ)∗

Parser conﬁgurations: (P C × N × K)

Acceptance: ﬁnal conﬁguration is <<•S, 0>, 0,[ ]>

INIT <<S•,0>, |w|,[ ]>

CALL−1<<•B, j >, j, (<A→α•Bγ, i>, K )>7−→ <<A→α•Bγ , i>, j, K>

START−1<<B→•β, j >, j, K> 7−→ <<•B, j >, j, K>

EXIT−1<<B•, k>, j, K > 7−→ <<B→β•, k>, j, K >

if (<B→β•, k>∈Σj)(non−determinism)

END−1<<A→αB•γ , i>, j, K> 7−→ <<B•, k>, j, (<A→α•B γ, i>, K)>

if (<B•, k>∈Σjand <A→α•B γ, i>∈Σk)(non−determinism)

SCAN−1<<A→αt•γ, i>, (j+ 1), K > 7−→ <<A→α•tγ , i>, j, K>

Fig. 4. NFA, Earley recognizer, and Earley parser: input word is w

A Graphical Model for Context-free Grammar Parsing 11

-closure of the set of nodes reachable from nodes in Σi−1by scan edges labeled w[i].

The string wis in the language recognized by the NFA if and only if S•∈Σn.

Figure 3(a) shows the GFG of Figure 1, but with simple node labels. Figure 3(b)

illustrates the behavior of the NFA simulation algorithm for the input string “7+8+9”.

Each Σiis associated with a terminal string pair Ci=u.v, which indicates that preﬁx u

of the input string w=uv has been read up to that point.

The behavior of this NFA -closure algorithm on a GFG is described concisely by

the rules shown in Figure 4(a). Each rule is an inference rule or constraint; in some

rules, the premises have multiple consequents. It is straightforward to use these rules

to compute the smallest Σ-sets that satisfy all the constraints. The INIT rule enters •S

into Σ0. Each of the other rules is associated with traversing a GFG edge from the node

in its assumption to the node in its consequence. Thus, the CALL, START, END, and

EXIT rules compute the -closure of a Σ-set; notice that the END rule is applied to all

outgoing edges from END nodes.

3.2 Earley’s algorithm

Like the NFA -closure algorithm, Earley’s algorithm builds Σsets, but it computes

reachability only along CR-paths starting at •S. Therefore, the main difference between

the two algorithms is at end nodes: a CR-path that reaches an end node should be

extended only to the return node corresponding to the last unmatched call node on

that path.

One way to ﬁnd this call node is to tag each start node with a unique ID (tag)

when it is entered into a Σ-set, and propagate this tag through the nodes of productions

for this non-terminal all the way to the end node. At the end node, this unique ID can

be used to identify the Σ-set containing corresponding start node. The last unmatched

call node on the path must be contained in that set as well, and from that node, the

return node to which the path should be extended can easily be determined.

To implement the tag, it is simple to use the number of the Σ-set to which the start

node is added, as shown in Figure 4(b). When the CALL rule enters a start node into

aΣset, the tag assigned to this node is the number of that Σset. The END rule is

the only rule that actually uses tags; all other rules propagate tags. If <B•, k> ∈Σj,

then the matching start and call nodes are in Σk, so Σkis examined to determine

which of the immediate predecessors of node •Boccur in this set. These must be call

nodes of the form A→α•Bγ, so the matching return nodes A→αB•γare added to Σj

with the tags of the corresponding call nodes. For a given grammar, this can easily be

done in time constant with respect to the length of the input string. A string is in the

language recognized by the GFG iff Σncontains <S•,0>. Figure 3(c) shows the Σ

sets computed by the Earley algorithm for the input string “7+8+9”.

We discuss a small detail in using the rules of Figures 4(a,b) to construct Σ-sets for

a given GFG and input word. The existence of a unique smallest sequence of Σ-sets can

be proved in many ways, such as by observing that the rules have the diamond property

and are strongly normalizing [19]. A canonical order of rule application for the NFA

rules is the following. We give a unique number to each GFG edge, and associate the

index hj, miwith a rule instance that corresponds to traversing edge mand adding the

destination node to Σj; the scheduler always pick the rule instance with the smallest

12 Keshav Pingali and Gianfranco Bilardi

index. This order completes the Σsets in sequence, but many other orders are possible.

The same order can be used for the rules in Figure 4(b) except that for the END rule,

we use the number on the edge (B•,A→αB•γ).

Correctness of the rules of Figure 4(b) follows from Theorem 2.

Theorem 2. For a grammar Γ=<N, T, P , S> and an input word w,< S•,0>∈Σ|w|

iff wis a word generated by grammar Γ.

Proof. See Section A.2.

The proof of Theorem 2 shows the following result, which is useful as a charac-

terization of the contents of Σsets. Let w[i..j]denote the substring of input wfrom

position ito position jinclusive if i≤j, and let it denote if i>j. It is shown that

<A→α•β, i> ∈Σjiff there is a CR-path P:•S ∗•A ∗(A→α•β)such that

1. •S ∗•Agenerates w[1..i], and

2. •A ∗(A→α•β)is balanced and generates w[(i+ 1)..j].

Like the NFA algorithm, Earley’s algorithm determines reachability along certain

paths but does not represent paths explicitly. Both algorithms permit such implicitly

maintained paths to share “sub-paths”: in Figure 3(c), E•in Σ1is reached by two CR-

paths, Q1: (•S p •E g h E•), and Q2: (•S p •E i

•E g h E•), and they share the sub-path (•E g h E•). This path

sharing permits Earley’s algorithm to run in O(|w|3)time for any grammar (improved to

O(|w|3/log|w|)by Graham et al [7]), and O(|w|2)time for any unambiguous grammar,

as we show in Theorem 3.

Theorem 3. For a given GFG G= (V, E)and input word w, Earley’s algorithm re-

quires O(|w|2)space and O(|w|3)time. If the grammar is unambiguous, the time com-

plexity is reduced to O(|w|2).

Proof. See Section A.2

Earley parser The rules in Figure 4(b) deﬁne a recognizer. To get a parser, we need a

way to enumerate a representation of the parse tree, such as a complete, balanced GFG

path, from the Σsets; if the grammar is ambiguous, there may be multiple complete,

balanced paths that generate the input word.

Figure 4(c) shows a state transition system that constructs such a path in reverse;

if there are multiple paths that generate the string, one of these paths is reconstructed

non-deterministically. The parser starts with the entry <S•,0>in the last Σset, and

reconstructs in reverse the inference chain that produced it from the entry <•S, 0>in

Σ0; intuitively, it traverses the GFG in reverse from S•to •S, using the Σset entries to

guide the traversal. Like the NGA, it maintains a stack, but it pushes the matching call

node when it traverses a return node, and pops the stack at a start node to determine

how to continue the traversal.

The state of the parser is a three-tuple: a Σset entry, the number of that Σset, and

the stack. The parser begins at <S•,0>in Σnand an empty stack. It terminates when

A Graphical Model for Context-free Grammar Parsing 13

it reaches <•S, 0>in Σ0. The sequence of GFG nodes in the reverse path can be output

during the execution of the transitions. It is easy to output other representations of parse

trees if needed; for example, the parse tree can be produced in reverse post-order by

outputting the terminal symbol or production name whenever a scan edge or exit node

respectively is traversed in reverse by the parser.

To eliminate the need to look up Σsets for the EXIT−1and END−1rules, the rec-

ognizer can save information relevant for the parser in a data structure associated with

each Σset. This data structure is a relation between the consequent and the premise(s)

of each rule application; given a consequent, it returns the premise(s) that produced that

consequent during recognition. If the grammar is ambiguous, there may be multiple

premise(s) that produced a given consequent, and the data structure returns one of them

non-deterministically. By enumerating these non-deterministic choices, it is possible to

enumerate different parse trees for the given input string. Note that if the grammar is

cyclic (that is, A+

→Afor some non-terminal A), there may be an inﬁnite number of

parse trees for some strings.

3.3 Discussion

In Earley’s paper, the call and start rules were combined into a single rule called

prediction, and the exit and end rules were combined into a single rule called comple-

tion [6]. Aycock and Horspool pre-compute some of the contents of Σ-sets to improve

the running time in practice [2].

Erasing tags from the rules in Figure 4(b) for the Earley recognizer produces the

rules for the NFA -closure algorithm in Figure4(a). The only nontrivial erasure is for

the end rule: k, the tag of the tuple <B•, k>, becomes undeﬁned when tags are deleted,

so the antecedent <A→α•Bγ , i> ∈Σkfor this rule is erased. Erasure of tags demon-

strates lucidly the close and previously unknown connection between the NFA -closure

algorithm and Earley’s algorithm.

4 Preprocessing the GFG: look-ahead

Preprocessing the GFG is useful when many strings have to be parsed since the invest-

ment in preprocessing time and space is amortized over the parsing of multiple strings.

Look-ahead computation is a form of preprocessing that permits pruning of the set of

paths that need to be explored for a given input string.

Given a CR-path Q:•S ∗vwhich generates a string of terminals u, consider the

set of all strings of kterminals that can be encountered along any CR extension of Q.

When parsing a string u`z with `∈Tk, extensions of path Qcan be safely ignored if `

does not belong to this set. We call this set the context-dependent look-ahead set at vfor

path Q, which we will write as CDLk(Q)(in the literature on program optimization,

Qis called the calling context for its last node v). LL(k) and LR(k) parsers use context-

dependent look-ahead sets.

We note that for pruning paths, it is safe to use any superset of CDLk(Q): larger

supersets may be easier to compute off-line, possibly at the price of less pruning on-

line. In this spirit, a widely used superset is F OLLOWk(v), associated with GFG node

14 Keshav Pingali and Gianfranco Bilardi

v, which we call the context-independent look-ahead set. It is the union of the sets

CDLk(Q), over all CR-paths Q:•S ∗v. Context-independent look-ahead is used

by SLL(k) and SLR(k) parsers. It has also been used to enhance Earley’s algorithm.

Look-ahead sets intermediate between CDLk(Q)and F OLLO Wk(v)have also been

exploited, for example in LALR(k) and LALL(k) parsers [17].

The presentation of look-ahead computations algorithms is simpliﬁed if, at every

stage of parsing, there is always a string `of ksymbols that has not yet been read. This

can be accomplished by (i) padding the input string wwith k$ symbols to form w$k,

where $/∈(T+N)and (ii) replacing Γ=<N, T , P, S>, with the augmented grammar

Γ0=<N0=N∪ {S0}, T 0=T∪ {$}, P 0=P∪ {S0→S$k}, S0>.

Figure 5(a) shows an example using a stylized GFG, with node labels omitted for

brevity. The set F OLLOW2(v)is shown in braces next to node v. If the word to be

parsed is ybc, the parser can see that yb /∈F O LLOW2(v)for v=S→•yLab, so it can

avoid exploration downstream of that node.

The inﬂuence of context is illustrated for node v=L→•a, in Figure 5(a). Since the

end node L•is reached before two terminal symbols are encountered, it is necessary to

look beyond node L•, but the path relevant to look-ahead depends on the path that was

taken to node •L. If the path taken was Q:•S0 ∗(S→y•Lab) (•L) L→•a, then

relevant path for look-ahead is L• (S→yL•ab) ∗S0•, so that CDL2(Q) = {aa}.

If the path taken was R:•S0 ∗(S→y•Lbc) (•L) L→•a, then the relevant path

for look-ahead is (L•) (S→yL•bc) ∗S0•, and CDL2(R) = {ab}.

We deﬁne these concepts formally next.

Deﬁnition 8. Context-dependent look-ahead: If vis a node in the GFG of an aug-

mented grammar Γ0=<N0, T 0, P 0, S 0>, the context-dependent look-ahead CDLk(Q)

for a CR-path Q:•S0 ∗vis the set of all k-preﬁxes of strings generated by paths

Qs:v ∗S0•where Q+Qsis a complete CR-path.

Deﬁnition 9. Context-independent look-ahead: If vis a node in the GF G for an aug-

mented grammar Γ0=<N0, T 0, P 0, S 0>,F OLLOWk(v)is the set of all k-preﬁxes of

strings generated by CR-paths v ∗S0•.

As customary, we let FO LLOWk(A)and F OLLOW (A)respectively denote F OLLOWk(A•)

and F OLLOW1(A•).

The rest of this section is devoted to the computation of look-ahead sets. It is con-

venient to introduce the function s1+ks2of strings s1and s2, which returns their

concatenation truncated to ksymbols. In Deﬁnition 10, this operation is lifted to sets of

strings.

Deﬁnition 10. Let T∗denote the set of strings of symbols from alphabet T.

–For E⊆T∗,(E)kis set of k-preﬁxes of strings in E.

–For E, F ∈T∗,E+kF= (E+F)k.

If E={, t, tu, abc}and F={, x, xy, xya},(E)2={, t, tu, ab}and (F)2={, x, xy }.

E+2F=(E+F)2={, x, xy, t, tx, tu, ab}. Lemma 1(a) says that concatenation fol-

lowed by truncation is equivalent to “pre-truncation” followed by concatenation and

A Graphical Model for Context-free Grammar Parsing 15

L

.S

L

y y

a b

b c

.L

a

{aa,ab}

{ab,bc}

S yLab | yLbc | M

L a | ε

M MM|x

{ya} {ya,yb}

x

M

.M

M

M

FIRST2(S) = (y+2 FIRST2(L)+2 {ab}) U (y+2 FIRST2(L)+2 {bc} ) U FIRST2(M)

FIRST2(L) = {a} U {ε}

FIRST2(M) = {x} U (FIRST2(M)+2FIRST2(M))

Solution:

FIRST2(S) = {ya,yb,x,xx} FIRST2(M) = {x,xx} FIRST2(L) = {a,ε}

FOLLOW2(S) = {$$}

FOLLOW2(L) = {ab} U {bc}

FOLLOW2(M) = FOLLOW2(S) U (FIRST2(M) +2 FOLLOW2(M)) U FOLLOW2(M))

Solution:

FOLLOW2(S) = {$$} FOLLOW2(L) = {ab,bc} FOLLOW2(M) = {$$,xx,x$}

{x$,xx}

{x$,xx} {x$,xx}

S.

L.

M.

$k

(a) FIRST2and FOLLOW2computation

L1

.S

L2

y y

a b

b c

.L1

a

{aa}

{ab}

Cloning non-terminal L:

S yL1ab | yL2bc | M

L1 a | ε

L2 a | ε

M MM | x

{ya} {ya,yb}

x

M

.M

M

M

.L2

a

{ab}

{bc}

Full cloning for 2-look-ahead:

S yL1ab | yL2bc | M1

L1 a|ε //L1 is [L,{ab}]

L2 a|ε //L2 is [L,{bc}]

M1 M2M1 |x //M1 is [M,{$$}]

M2 M3M2 |x //M2 is [M,{x$,xx}]

M3 M3M3 |x //M3 is [M,{xx}]

S.

.L1 L2.

M.

$k

(b) Partial and full 2-look-ahead cloning

Fig. 5. Look-head computation example

16 Keshav Pingali and Gianfranco Bilardi

truncation; this permits look-ahead computation algorithms to work with strings of

length at most kthroughout the computation rather than with strings of unbounded

length truncated to konly at the end.

Lemma 1. Function +khas the following properties.

(a) E+kF= (E)k+k(F)k.

(b) +kis associative and distributes over set union.

4.1 Context-independent look-ahead

F OLLOWk(v)can be computed by exploring CR-paths from vto S0•. However, for

the “bulk” problem of computing these sets for many GFG nodes, such as all entry

nodes in a GFG, coordination of path explorations at different nodes can yield greater

efﬁciency.

Although we do not use this approach directly, the GFG permits F OLLOWkcom-

putation to be viewed as an inter-procedural backward dataﬂow analysis problem [14].

Dataﬂow facts are possible F OLLOWksets, which are the subsets of Tk, and which

form a ﬁnite lattice under subset ordering (the empty set is the least element). For an

edge ewith label t, the dataﬂow transfer function Fe(X)is {t}+kX(for edges, this

reduces to the identity function as expected). For a path Qwith edges labeled t1, ...tn,

the composite transfer function is ({t1}+k({t2}+k...({tn}+kX)), which can be

written as ({t1}+k{t2}+k...{tn}) +kXbecause +kis associative. If we denote the

k-preﬁx of the terminal string generated by Qby F I RSTk(Q), the composite trans-

fer function for a path Qis F I RSTk(Q) +kX. The conﬂuence operator is set union.

To ensure that dataﬂow information is propagated only along (reverse) CR-paths, it is

necessary to ﬁnd inter-procedural summary functions that permit look-ahead sets to be

propagated directly from a return node to its matching call node. These summary func-

tions are hard to compute for most dataﬂow problems but this is easy for F OLLOWk

computation because the lattice Lis ﬁnite, the transfer functions distribute over set

union, and the +koperation is associative. For a non-terminal A, the summary function

is FA(X) = F I RSTk(A) +kX, where F I RSTk(A)is the set of k-preﬁxes of termi-

nal strings generated by balanced paths from •Ato A•. The F I RSTkrelation can be

computed efﬁciently as described in Section 4.1. This permits the use of the functional

approach to inter-procedural dataﬂow analysis [14] to solve the F OLLOWkcomputa-

tion problem (the development below does not rely on any results from this framework).

F IRSTkcomputation For Γ=<N , T, P, S >,FI RSTk(A)for A∈Nis deﬁned

canonically as the set of k-preﬁxes of terminal strings derived from A[17]. This is

equivalent to the following, as we show in Theorem 4.

Deﬁnition 11. Given a grammar Γ=<N, T , P, S>, a positive integer kand A∈N,

F I RSTk(A)is the set of k-preﬁxes of terminal strings generated by balanced paths

from •Ato A•.

Following convention, we write F IRS T (A)to mean F IRS T1(A).

A Graphical Model for Context-free Grammar Parsing 17

Deﬁnition 12. F I RSTkis extended to a string u1u2...un∈(N∪T)∗as follows.

F I RSTk() = {}

F I RSTk(t∈T) = {t}

F I RSTk(u1u2...un) = F I RSTk(u1) +k... +kF I RSTk(un)

F I RSTksets for non-terminals can be computed as the least solution of a system

of equations derived from the grammar.

Algorithm 1 For Γ=<N, T , P, S> and positive integer k, let Mbe the ﬁnite lattice

whose elements are sets of terminal strings of length at most k, ordered by containment

with the empty set being the least element. The F I RSTksets for the non-terminals are

given by the least solution in Mof this equational system:

∀A∈N F I RSTk(A) = [

A→α

F I RSTk(α)

Figure 5(a) shows an example.

F OLLOWkcomputation

Algorithm 2 Given an augmented grammar Γ0=<N0, T 0, P 0, S 0>and positive inte-

ger k, let Lbe the lattice whose elements are sets of terminal strings of length k, or-

dered by containment with the empty set being the least element. The F OLLOWksets

for non-terminals other than S0are given by the least solution of this equational system:

F OLLOWk(S) = {$k}

∀B∈N− {S, S0}.F O LLOWk(B) = [

A→αBγ

F I RSTk(γ) +kF OLLOWk(A)

Given F OLLOWksets for non-terminals, F OLLOWksets at all GFG nodes are

computed by interpolation:

F OLLOWk(A→α•β) = F I RSTk(β) +kF O LLOWk(A).

Figure 5(a) shows an example. Moccurs in three places on the righthand sides of

the grammar productions, so the righthand side of the equation for F OLLOWk(M)

is the union of three sets: the ﬁrst from S→M•, the second from M→M•M, and the

third from M→MM •.

Using context-independent look-ahead in the Earley parser Some implementations

of Earley’s parser use a context-independent look-ahead of one symbol at start nodes

and end nodes (this is called prediction look-ahead and completion look-ahead respec-

tively) [6]. The practical beneﬁt of using look-ahead in the Earley parser has been de-

bated in the literature. The implementation of Graham et al does not use look-ahead [7];

other studies argue that some beneﬁts accrue from using prediction look-ahead [8]. Pre-

diction look-ahead is implemented by modifying the START rule in Figure 4(b):the

production B→βis explored only if βmight produce the empty string or a string that

18 Keshav Pingali and Gianfranco Bilardi

starts with the ﬁrst look-ahead symbol. For this, the following formula is added to the

antecedents of the START rule: (∈F IR ST (β)) ∨(Cj=u.tv ∧t∈F IR ST (β)).

Completion look-ahead requires adding the following check to the antecedents of

the END rule in Figure 4(b):

(Cj=u.tv)∧(t∈F I RST (γ)∨(∈F I RST (γ)∧t∈F OLLO W (A))).

4.2 Context-dependent look-ahead

LL(k) and LR(k) parsers use context-dependent k-look-ahead. As one would expect, ex-

ploiting context enables a parser to rule out more paths than if it uses context-independent

look-ahead. One way to implement context-dependent look-ahead for a grammar Γis

to reduce it to the problem of computing context-independent look-ahead for a related

grammar Γcthrough an operation similar to procedure cloning.

In general, cloning a non-terminal Ain a grammar Γcreates a new grammar in

which (i) non-terminal Ais replaced by some number of new non-terminals A1,A2,...Ac

(c≥2) with the same productions as A, and (ii) all occurrences of Ain the righthand

sides of productions are replaced by some Aj(1≤j≤c). Figure 5(b) shows the result

of cloning non-terminal Lin the grammar of Figure 5(a) into two new non-terminals

L1, L2. Cloning obviously does not change the language recognized by the grammar.

The intuitive idea behind the use of cloning to implement context-dependent look-

ahead is to create a cloned grammar that has a copy of each production in Γfor each

context in which that production may be invoked, so as to “de-alias” look-ahead sets.

In general, it is infeasible to clone a non-terminal for every one of its calling contexts,

which can be inﬁnitely many. Fortunately, contexts with the same look-ahead set can be

represented by the same clone. Therefore, the number of necessary clones is bounded by

the number of possible k-look-ahead sets for a node, which is 2|T|k. Since this number

grows rapidly with k, cloning is practical only for small values of k, but the principle is

clear.

Algorithm 3 Given an augmented grammar Γ0=(N0, T 0, P 0, S 0), and a positive inte-

ger k,Tk(Γ0)is following grammar:

–Nonterminals: {S0}∪{[A, R]|A∈(N0−S0), R ⊆T0k}

–Terminals: T’

–Start symbol: S0

–Productions:

•S0→αwhere S0→α∈Γ0

•all productions [A, R]→Y1Y2...Ymwhere for some A→X1X2X3...Xm∈P0

Yi=Xiif Xiis a terminal, and

Yi= [Xi, F I RSTk(Xi+1 ...Xm) +kR]otherwise.

Therefore, to convert the context-dependent look-ahead problem to the context-

independent problem, cloning is performed as follows. For a given k, each non-terminal

Ain the original grammar is replaced by a set of non-terminals [A, R]for every R⊆Tk

(intuitively, Rwill end up being the context-independent look-ahead at [A, R]•in the

cloned grammar). The look-ahead Ris then interpolated into each production of Ato

A Graphical Model for Context-free Grammar Parsing 19

determine the new productions as shown in Algorithm 3.Figure 5(b) shows the result

of full 2-look-ahead cloning of the grammar in Figure 5(a) after useless non-terminals

have been removed.

5 Related work

The connection between context-free grammars and procedure call/return in program-

ming languages was made in the early 1960’s when the ﬁrst recursive-descent parsers

were developed. The approach taken in this paper is to formulate parsing problems as

path problems in the GFG, and the procedure call/return mechanism is used only to

build intuition.

In 1970, Woods deﬁned a generalization of ﬁnite-state automata called recursive

transition networks (RTNs) [20]. Perlin deﬁnes an RTN as “..a forest of disconnected

transition networks, each identiﬁed by a nonterminal label. All other labels are terminal

labels. When, in traversing a transition network, a nonterminal label is encountered,

control recursively passes to the beginning of the correspondingly labeled transition

network. Should this labeled network be successfully traversed, on exit, control returns

back to the labeled calling node” [12]. The RTN was the ﬁrst graphical representation

of context-free grammars, and all subsequent graphical representations including the

GFG are variations on this theme. Notation similar to GFG start and end nodes was

ﬁrst introduced by Graham et al in their study of the Earley parser [7]. The RTN with

this extension is used in the ANTLR system for LL(*) grammars [10].

The key difference between RTNs and GFGs is in the interpretation of the graphical

representation. An interpretation based on a single locus of control that ﬂows between

productions is adequate for SLL(k)/LL(k)/LL(*) languages but inadequate for handling

more general grammars for which multiple paths through the GFG must be followed, so

some notion of multiple threads of control needs to be added to the basic interpretation

of the RTN. For example, Perlin models LR grammars using a chart parsing strategy in

which portions of the transition network are copied dynamically [12]. In contrast, the

GFG is a single graph, and all parsing problems are formulated as path problems in this

graph; there is no operational notion of a locus of control that is transferred between

productions. In particular, the similarity between Earley’s algorithm and the NFA sim-

ulation algorithm emerges only if parsing problems are framed as path problems in a

single graph. We note that the importance of the distinction between the two viewpoints

was highlighted by Sharir and Pnueli in their seminal work on inter-procedural dataﬂow

analysis [14].

The logic programming community has explored the notion of “parsing as deduc-

tion” [11,15,16] in which the rules of the Earley recognizer in Figure 4(b) are consid-

ered to be inference rules derived from a grammar, and recognition is viewed as the

construction of a proof that a given string is in the language generated by that grammar.

The GFG shows that this proof construction can be interpreted as constructing complete

balanced paths in a graphical representation of the grammar.

An important connection between inter-procedural dataﬂow analysis and reacha-

bility computation was made by Yannakakis [21], who introduced the notion of CFL-

paths. Given a graph with labeled edges and a context-free grammar, CFL-paths are

20 Keshav Pingali and Gianfranco Bilardi

paths that generate strings recognized by the given context-free grammar. Therefore,

the context-free grammar is external to the graph, whereas the GFG is a direct repre-

sentation of a context-free grammar with labeled nodes (start and end nodes must be

known) and labeled edges. If node labels are erased from a GFG and CFL-paths for the

given grammar are computed, this set of paths will include all the complete balanced

paths but in general, it will also include non-CR-paths that happen to generate strings

in the language recognized by the context-free grammar.

6 Conclusions

In other work, we have shown that the GFG permits an elementary presentation of LL,

SLL, LR, SLR, and LALR grammars in terms of GFG paths. These results and the

results in this paper suggest that the GFG can be a new foundation for the study of

context-free grammars.

Acknowledgments: We would like to thank Laura Kallmeyer for pointing us to the

literature on parsing in the logic programming community, and Giorgio Satta and Lillian

Lee for useful discussions about parsing.

References

1. A. Aho, M. Lam, R. Sethi, , and J. Ullman. Compilers: principles, techniques, and tools.

Addison Wesley, 2007.

2. J. Aycock and N. Horspool. Practical Earley parsing. The Computer Journal, 45(6):620–630,

2002.

3. W. D. Clinger and C. Halpern. Alternative semantics for mccarthy’s amb. In Seminar on Con-

currency, Carnegie-Mellon University, pages 467–478, London, UK, UK, 1985. Springer-

Verlag.

4. T. Cormen, C. Leiserson, R. Rivest, and C. Stein, editors. Introduction to Algorithms. MIT

Press, 2001.

5. P. Cousot and R. Cousot. Parsing as abstract interpretation of grammar semantics. Theoret.

Comput. Sci., 290:531–544, 2003.

6. J. Earley. An efﬁcient context-free parsing algorithm. Commun. ACM, 13(2):94–102, Feb.

1970.

7. S. L. Graham, W. L. Ruzzo, and M. Harrison. An improved context-free recognizer. ACM

TOPLAS, 2(3):415–462, July 1980.

8. D. Grune and C. Jacobs. Parsing Techniques: a practical guide. Springer-Verlag, 2010.

9. J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages,

and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston,

MA, USA, 2006.

10. T. Parr and K. Fisher. LL(*): the foundation of the ANTLR parser generator. In PLDI, 2011.

11. F. C. N. Pereira and D. Warren. Parsing as deduction. In 21st Annual Meeting of the As-

sociation for Computational Linguistics, pages 137–144, MIT, Cambridge, Massachusetts,

1983.

12. M. Perlin. LR recursive transition networks for Earley and Tomita parsing. In ACL ’91,

1991.

13. J. C. Reynolds. On the relation between direct and continuation semantics. In Proceedings of

the 2nd Colloquium on Automata, Languages and Programming, pages 141–156. Springer-

Verlag, 1974.

A Graphical Model for Context-free Grammar Parsing 21

14. M. Sharir and A. Pnueli. Program ﬂow analysis: theory and applications, chapter Two

approaches to interprocedural dataﬂow analysis, pages 189–234. Prentice-Hall, 1981.

15. S. M. Shieber, Y. Schabes, and F. C. N. Pereira. Principles and implementation of deductive

parsing. Journal of Logic Programming, 24(1 and 2):3–36, 1995.

16. K. Sikkel. Parsing Schemata. Texts in Theoretical Computer Science. Springer-Verlag,

Berlin, Heidelberg, New York, 1997.

17. S. Sippu and E. Soisalon-Soininen. Parsing theory. Springer-Verlag, 1988.

18. G. Sussman and G. Steele. Scheme: An interpreter for extended lambda calculus. Technical

Report AI Memo 349, AI Lab, M.I.T., December 1975.

19. Terese. Term Rewriting Systems. Combridge University Press, 2003.

20. W. A. Woods. Transition network grammars for natural language analysis. Commun. ACM,

13(10), Oct. 1970.

21. M. Yannakakis. Graph-theoretic methods in database theory. In Principles of Database

Systems, 1990.

A Appendix

A.1 Derivations, parse trees and GFG paths

The following result connects complete balanced paths to parse trees.

Theorem 4. Let Γ=<N, T , P, S> be a context-free grammar and G=GF G(Γ)the

corresponding grammar ﬂow graph. Let A∈N. There exists a balanced path from •A

to A•with ncr call-return pairs that generates a string w∈T∗if and only if there exists

a parse tree for wwith nint =ncr + 1 internal nodes.

Proof. We proceed by induction on ncr . The base case, ncr = 0, arises for a produc-

tion A→u1u2. . . urwhere each ujis a terminal. The GFG balanced path contains the

sequence of nodes

•A, A→•u1u2. . . ur,...A→u1u2. . . ur•, A•

The corresponding parse tree has a root with label Aand rchildren respectively labeled

u1, u2, . . . , ur(from left to right), with nint = 1 internal node. The string generated by

the path and derived from the tree is w=u1u2. . . ur.

Assume now inductively the stated property for paths with fewer than ncr call-return

pairs and trees with fewer than nint internal nodes. Let Qbe a path from •Ato A•with

ncr call-return pairs. Let A→u1u2. . . urbe the “top production” used by Q,i.e., the

second node on the path is A→•u1u2. . . ur. If uj∈N, then Qwill contain a segment

of the form

A→u1. . . uj−1•uj. . . ur, Qj, A→u1. . . uj•uj+1 . . . ur

where Qjis a balanced path from •ujto uj•, generating some word wj. Let Tjbe a

parse tree for wjwith root labeled uj, whose existence follows by the inductive hypoth-

esis. If instead uj∈T, then Qwill contain the scan edge

(A→u1. . . uj−1•uj. . . ur, A→u1. . . uj•uj+1 . . . ur)

22 Keshav Pingali and Gianfranco Bilardi

generating the word wj=uj. Let Tjbe a tree with a single node labeled wj=uj. The

word generated by Qis w=w1w2. . . wr. Clearly, the tree Twith a root labeled A

and rsubtrees equal (from left to right) to T1,T2,...,Trderives string w. Finally, it is

simple to show that Thas nint =ncr + 1 internal nodes.

The construction of a balanced path generating wfrom a tree deriving wfollows the

same structure.

A.2 Correctness and complexity of Earley’s algorithm

The following result is an “inductive version” of Theorem 2, which asserts the correct-

ness of the rules for the Earley parser.

Theorem 5. Consider the execution of Earley’s algorithm on input string w=a1a2. . . an.

Let zbe a GFG node and iand jbe integers such that 0≤i≤j≤n. The following

two properties are equivalent.

(A) The algorithm creates an entry <z, i> in Σj.

(B) There is a CR-path Q= (•S)Q0z(represented as a sequence of GFG nodes begin-

ning at •Sand ending at z) that generates a1a2. . . ajand whose preﬁx preceding the

last unmatched call edge generates a1a2. . . ai.

Proof. Intuitively, the key fact is that each rule of Earley’s algorithm (aside from ini-

tialization) uses an entry <y, i0>∈Σj0and a GFG edge (y, z )to create an entry

<z, i> ∈Σj, where the dependence of iand jupon i0and j0depends on the type

of edge (y, z). For a return edge, a suitable entry <z0, k> ∈Σi0is also consulted. In

essence, if a CR-path can be extended by an edge, then (and only then) the appropriate

rule creates the entry for the extended path. The formal proof is an inductive formula-

tion of this intuition and carries out a case analysis with respect to the type of edge that

extends the path.

Part I. B⇒A(from CR-path to Earley entry). The argument proceeds by induction

on the length (number of edges) `of path Q.

- Base cases (`= 0,1).

The only path with no edges is Q= (•S), for which i=j= 0. The INIT rule produces

the corresponding entry <•S, 0>∈Σ0. The paths with just one edge are also easily

dealt with, as they are of the form Q= (•S)(S→•σ), that is, they contain one ENTRY

edge.

- Inductive step (from `−1≥1to `).

Consider a CR-path Q= (•S)Ryz of length `. It is straightforward to check that Q0=

(•S)Ry is also a CR-path, of length `−1. Hence, by the inductive hypothesis, an entry

<y, i0>is created by the algorithm in some Σj0, with Q0generating a1a2. . . aj0and

with the preﬁx of Q0preceding its last unmatched call edge generating a1a2. . . ai0.

Inspection of the rules for the Earley parser in Figure 4 reveals that, given <y, i0>∈

Σj0and given the presence edge (y, z)in the CR-path Q, an entry <z, i> ∈Σjis

always created by the algorithm. It remains to show that iand jhave, with respect to

path Q, the relationship stated in property (B).

A Graphical Model for Context-free Grammar Parsing 23

- Frame number j. We observe that the string of terminals generated by Qis the

same as the string generated by Q0, except when (y, z)is a scan edge, in which case Q

does generate a1a2. . . aj0+1. Correspondingly, the algorithm sets j=j0, except when

(y, z)is a scan edge, in which case it sets j=j0+ 1.

- Tag i. We distinguish three cases, based on the type of edge.

–When (y, z)is an entry,scan, or exit edge, Qhas the same last unmatched call edge

as Q0. Correspondingly, i=i0.

–When (y, z)is a call edge, then (y, z)is the last unmatched call edge on Q. The

algorithm correctly sets i=j0=j.

–Finally, let (y, z)be a return edge, with y=B•and z=A→αB•γ. Since Qis

a CR-path, (y, z)must match the last unmatched call edge in Q0, say, (z0, y0), with

z0=A→α•Bγ, and y0=•B. We can then write Q= (•S)Q1z0y0Q2yz where

Q2is balanced, whence Qand (•S)Q1z0have the same last unmatched call edge,

say (u, v). Let i0be such that the preﬁx of Qending at z0generates a1a2. . . ai0

and let k≤i0be such that the preﬁx of Qending at ugenerates a1a2. . . ak. By

the inductive hypothesis, corresponding to path (•S)Q1z0, the algorithm will have

created entry <z0=A→α•B γ, k> ∈Σi0. From entries <y =B•, i0>∈Σj0

and <z0=A→α•B γ, k> ∈Σi0as well as from return edge (y, z ), the END rule

of the algorithm, as written in Figure 4, creates <z =A→αB•γ , i =i0>∈Σj.

Part II. A⇒B(from Earley entry to CR-path). The argument proceeds by induction

on the number qof rule applications executed by the algorithm when entry <z, i> is

ﬁrst added to Σj. (Further “discoveries” that <z, i> ∈Σjare possible, but the entry is

added only once.)

- Base case (q= 1). The only rule applicable at ﬁrst is INIT, creating the entry <•S, 0>∈

Σ0, whose corresponding path is clearly Q= (•S).

- Inductive step (from q−1≥1to q). Let the q-th rule application of the algorithm

be based on GFG edge (y, z)and on entry <y, i0>∈Σj0. Also let <z, i> ∈Σjbe

the entry created by the algorithm as a result of said rule application. By the inductive

hypothesis, there is a CR-path (•S)Q0ygenerating a1a2. . . aj0and with the preﬁx of

Q0preceding its last unmatched call edge generating a1a2. . . ai0. To show that to entry

<z, i> ∈Σjthere corresponds a CR-path Qas in (B), we consider two cases, based

on the type of edge (y, z).

–When (y, z)is an entry,scan,exit or call edge, we consider the path Q= (•S)Q0yz.

Arguments symmetric to those employed in Part I of the proof show that path the Q

does satisfy property (B), with exactly the values iand jof the entry <z, i> ∈Σj

produced by the algorithm.

–When (y, z)is a return edge, the identiﬁcation of path Qrequires more care. Let

y=B•and z=A→αB•γ. The END rule of Earley’s algorithm creates entry

<z, i> ∈Σjbased on two previously created entries to each of which, by the

inductive hypothesis, there corresponds a path, as discussed next.

To entry <y =B•, k> ∈Σj, there correspond a CR-path of the form Q0=

(•S)Q0

1x0y0Q2y, with last unmatched call edge (x0, y0), where y0=•Band Q2is

balanced.

24 Keshav Pingali and Gianfranco Bilardi

To entry <z0=A→α•B γ, i> ∈Σkthere correspond a CR-path of the form

Q00 = (•S)Q1z0, where z0=A→α•Bγ.

From the above two paths, as well as from return edge (y, z), we can form a third

CR-path Q= (•S)Q1z0y0Q2yz. We observe that is is legitimate to concatenate

(•S)Q1z0with y0Q2yvia the call edge (z0, y0)since y0Q2yis balanced. It is also

legitimate to append return edge (y, z)to (•S)Q1z0y0Q2y(thus obtaining Q), since

such edge does match (z0, y0), the last unmatched call edge of said path.

It is ﬁnally straightforward to check that the frame number jand the tag iare

appropriate for Q.

Proof of Theorem3 For a given GFG G= (V, E )and input word w, Earley’s algo-

rithm requires O(|w|2)space and O(|w|3)time. If the grammar is unambiguous, the

time complexity is reduced to O(|w|2).

Proof. –Space complexity: There are |w|+ 1 Σ-sets, and each Σ-set can have at

most |V||w|elements since there are |w|+ 1 possible tags. Therefore, the space

complexity of the algorithm is O(|w|2).

–Time complexity: For the time complexity, we need to estimate the number of dis-

tinct rule instances that can be invoked and the time to execute each one (intuitively,

the number of times each rule can “ﬁre” and the cost of each ﬁring).

For the time to execute each rule instance, we note that the only non-trivial rule is

the end rule: when <B•, k> is added to Σj, we must look up Σkto ﬁnd entries of

the form <A→α•Bγ , i>. To permit this search to be done in constant time per en-

try, we maintain a data structure with each Σset, indexed by a non-terminal, which

returns the list of such entries for that non-terminal. Therefore, all rule instances

can be executed in constant time per instance.

We now compute an upper bound on the number of distinct rule instances for each

rule schema. The init rule schema has only one instance. The start rule schema has

a two parameters: the particular start node in the GFG at which this rule schema

is being applied and the tag j, and it can be applied for each outgoing edge of that

start node, so the number of instances of this rule is O(|V|∗|V|∗|w|); for a given

GFG, this is O(|w|).

Similarly, the end rule schema has four parameters: the particular end node in the

GFG, and the values of i, j, k; the relevant return node is determined by these

parameters. Therefore, an upper bound on the number of instances of this schema

is O(|V||w|3), which is O(|w|3)for a given GFG.

A similar argument shows that the complexity of call,exit and scan rule schema

instances is O(|w|2).

Therefore the complexity of the overall algorithm is O(|w|3).

–Unambiguous grammar: As shown above, the cubic complexity of Earley’s algo-

rithm arises from the end rule. Consider the consequent of the end rule. The proof

of Theorem 2 shows that <A→αB•γ, i> ∈Σjiff w[i..(j−1)] can be derived

from αB. If the grammar is unambiguous, there can be only one such derivation;

considering the antecedents of the end rule, this means that for a given return node

A→αB•γand given values of iand j, there can be exactly one kfor which the an-

tecedents of the end rule are true. Therefore, for an unambiguous grammar, the end

A Graphical Model for Context-free Grammar Parsing 25

rule schema can be instantiated at most O(|w|2)times for a given grammar. Since

all other rules are bounded above similarly, we conclude that Earley’s algorithm

runs in time O(|w|2)for an unambiguous grammar.

A.3 Look-ahead computation

Proof of correctness of Algorithm 1:

Proof. The system of equations can be solved using Jacobi iteration, with FI RSTk(A) =

{} as the initial approximation for A∈N. If the sequence of approximate solutions for

the system is X0;X1;..., the set Xi[A](i≥1) contains k-preﬁxes of terminal strings

generated by balanced paths from •Ato A•in which the number of call-return pairs

is at most (i−1). Termination follows from monotonicity of set union and +k, and

ﬁniteness of M.

Proof of correctness of Algorithm 2:

Proof. The system of equations can be solved using Jacobi iteration. If the sequence

of approximate solutions is X0;X1;..., then Xi[B](i≥1) contains the k-preﬁxes of

terminal strings generated by CR-paths from B•to S0•in which there are ior fewer

unmatched return nodes.