ArticlePDF Available

A new top-down parsing algorithm to accommodate ambiguity and left recursion in polynomial time

Authors:

Abstract

Top-down backtracking language processors are highly modular, can handle ambiguity, and are easy to implement with clear and maintainable code. However, a widely-held, and incorrect, view is that top-down processors are inherently exponential for ambiguous grammars and cannot accommodate left-recursive productions. It has been known for many years that exponential complexity can be avoided by memoization, and that left-recursive productions can be accommodated through a variety of techniques. However, until now, memoization and techniques for handling left recursion have either been presented independently, or else attempts at their integration have compromised modularity and clarity of the code.
A New Top-Down Parsing Algorithm to Accommodate
Ambiguity and Left Recursion in Polynomial Time
Richard A. Frost and Rahmatullah Hafiz
School of Computer Science, University of Windsor
401 Sunset Avenue, Windsor, Ontario Canada ON N9B3P4
rfrost@cogeco.ca
ABSTRACT
Top-down backtracking language processors are highly
modular, can handle ambiguity, and are easy to implement
with clear and maintainable code. However, a widely-held,
and incorrect, view is that top-down processors are in-
herently exponential for ambiguous grammars and cannot
accommodate left-recursive productions. It has been known
for many years that exponential complexity can be avoided
by memoization, and that left-recursive productions can be
accommodated through a variety of techniques. However,
until now, memoization and techniques for handling left
recursion have either been presented independently, or else
attempts at their integration have compromised modularity
and clarity of the code.
Categories and Subject Descriptors
D.3.3 [Programming Languages]: Language Constructs
and Features — Control structures; Processors — Pars-
ing; F.4.2 [Mathematical Logic and Formal Lan-
guages]:Grammars and Other Rewriting Systems-grammar
types; parsing;I.2.7 [Artificial Intelligence]: Natural-
language Processing-language models; language parsing
and understanding
Keywords
Top-down parsing, left-recursion, memoization, backtrack-
ing, parser combinators.
1 INTRODUCTION
Top-down backtracking language processors have a num-
ber of advantages compared to other methods: 1) they are
general and can be used to implement ambiguous gram-
mars, 2) they are easy to implement in any language which
supports recursion. Associating semantic rules with the re-
cursive functions that implement the syntactic productions
rules of the grammar is straightforward, 3) they are highly
modular (Koskomies [8]) and components can be tested in-
dependently and easily reused, 4) the structure of the code
is closely related to the structure of the grammar of the lan-
guage to be processed, and 5) in functional programming,
higher-order functions, called parser combinators, can be
defined so that language processors can be implemented as
executable specifications of grammars, and in Logic Pro-
gramming, Definite Clause Grammars (DCGs) can be used
to the same effect.
However, a naive implementation of top-down process-
ing, results in the processor repeating much of its work. In
the worst case, this results in exponential complexity for
highly-ambiguous grammars, even for recognition which is
known to be polynomial. In addition, a naive implemen-
tation cannot accommodate left-recursive grammar produc-
tions as the top-down search would result in infinite descent.
This paper contains an informal description of a new
top-down parsing algorithm which accommodates ambigu-
ity and left recursion in polynomial time. The algorithm
is described using the notation of set theory, and informal
proofs of termination and complexity are provided. Results
of an implementation in programming language Haskell are
presented, more details of which are available in a techni-
cal report available from the School of Computer Science
at the University of Windsor [2]. A formal description of
the algorithm together with more-detailed proofs of partial
correctness, termination and complexity is in preparation.
1.2 Misconceptions
For many years it was assumed that the exponential com-
plexity of top-down recognition of ambiguous sentences
was inevitable. However, in 1991, Norvig [13] showed
that polynomial complexity for top-down recognizers, built
in LISP, could be achieved by use of memoization in which
the results of each step of the process are stored in a memo
table and made use of by subsequent steps.
It is also widely believed that top-down language pro-
cessors cannot accommodate left-recursive productions.
Using the search terms "top-down" and "left-recursion" on
Google returns over 14,000 hits. Review of the results
shows the extent to which it continues to be assumed that
left-recursion must be eliminated (by rewriting the gram-
mars) before top-down processing can be used. However,
ACM SIGPLAN Notices 46 Vol. 41 (5), May 2006
such rewriting is not strictly necessary, and several re-
searchers have proposed ways in which left-recursion can
be accommodated:
1. Kuno [9] appears to have been the first to use the length
of the input to force termination of left-recursive de-
scent in top-down processing. The minimal lengths of
the strings generated by the grammar on the continu-
ation stack are added and when their sum exceeds the
length of the remaining input, expansion of the current
non-terminal is terminated. However, Kuno’s method
is exponential in the worst case.
2. Shiel [14] noticed the relationship between top-down
and Chart parsing and developed an approach in which
procedures corresponding to non-terminals are called
with an extra parameter indicating how many termi-
nals they should read from the input. When a proce-
dure corresponding to a rule defining a non-terminal
nis applied, the value of this extra parameter is parti-
tioned into smaller values which are then passed to the
component procedures on the right of the rule defining
n. The processor backtracks when a procedure defin-
ing a non-terminal is applied with the same parameter
to the same input position. The method terminates for
left-recursion but is exponential in the worst case.
3. Leermakers [10] has developed a functional approach
to memoized parsing which avoids the left-recursion
problem through "recursive ascent" rather than a top-
down search process. Although maintaining polyno-
mial complexity, the approach compromises modular-
ity and clarity of the code.
4. In earlier work, one of the authors of this paper no-
ticed that rewriting left-recursive recognizers to non-
left-recursive form is relatively simple but that rewrit-
ing attributed grammars (which contain semantic as
well as syntactic rules) can be very difficult. To avoid
this difficulty, a method was developed in which non-
left-recursive recognizers are used as guards to pre-
vent non-termination of the left-recursive executable
attribute grammars which they guard [4]. However,
the method is exponential in the worst case.
5. Nederhof and Koster [12] have developed a method
called "cancellation" parsing in which grammar rules
are translated into DCG rules such that each DCG
non-terminal is given a "cancellation set" as an extra
argument. Every time that a new non-terminal is
derived in the expansion of a rule, this non-terminal
is added to the cancellation set and the resulting set
is passed on to the next symbol in the expansion. If
a non-terminal is derived which is already in the set
then the parser backtracks. This technique prevents
non-termination of left-recursion. However, by itself,
it would miss certain parses. Therefore, the method
also requires that for each non-terminal N, which has
a left-recursive alternative 1) a function is added to the
parser which places a special token Nat the front of
the input to be Recognized, 2) a DCG corresponding
to the rule N ::= N is added to the parser, and 3) the
new DCG is invoked after the left-recursive DCG has
been called. The approach accommodates explicit left-
recursion and maintains modularity. An extension to
it also accommodates hidden left recursion which can
occur when the grammar contains rules with empty
right-hand sides. The shortcoming of Nederhof and
Koster’s approach is that it is exponential in the worst
case and that the resulting code is less clear as it
contains additional production rules and code to insert
the special tokens.
6. Lickman [11] has developed a technique by which pure
functional monadic parser combinators can be modi-
fied to accommodate left recursion. The method is
based on an idea put forward by Wadler in an un-
published paper in which he claimed that fixed points
could be used to accommodate left recursion. Lick-
man fleshes out Wadler’s idea by providing a formal
mathematical justification of termination. The method
involves constructing a fixed-point combinator for the
set monad and then using this function to build an
efficient fixed-point combinator for the parser monad
(again based on an idea by Wadler). Lickman has also
developed a program which automatically generates
parsers in the pure functional programming language
Haskell from the BNF specification of the grammar.
The method accommodates left recursion whilst main-
taining modularity and clarity of the code. However,
it has exponential complexity.
7. Johnson [6] has developed a method by which mem-
oized top-down parser combinators can accommodate
left recursion in the impure-functional programming
language Scheme. The basic idea is to use the CPS,
continuation-passing style, of programming so that the
parser computes multiple results, for ambiguous cases,
incrementally. Johnson demonstrates how CPS can
be integrated with memoization so that polynomial
complexity and termination with left recursion can be
achieved with top-down parsing. Surprisingly, John-
son’s paper has not been widely cited and his approach
does not appear to have been used by others. One ex-
planation for this could be that the approach is some-
what convoluted and extending it to return packed rep-
resentations of parse trees, as in Tomita’s Chart parser
[27], could be too complicated.
8. Camarao, Figueiredo, and Oliveiro [1] claim to have
built a monadic combinator compiler generator called
Mimico which accommodates left recursion. However
it does not handle ambiguous grammars.
ACM SIGPLAN Notices 47 Vol. 41 (5), May 2006
1.3 Left-Recursion?
There are two reasons why we want to implement left-
recursive grammars: firstly, it is often easier to add attribute
computations to language processors that implement left-
recursive grammars. As a trivial, but illustrative example,
consider a processor which converts numbers represented
as character strings to their values. The left-recursive
formulation is as follows:
number ::= digit
number.VAL = digit.VAL
| number’ digit
number.VAL = (10 * number’.VAL
+ digit.VAL)
digit ::= ’0’
digit.VAL = 0
| ’1’
digit.VAL = 1 etc.
The right-recursive formulation is more complex and
requires an additional attribute.
Secondly, and perhaps more importantly, the advan-
tages of top-down backtracking parsers make them ide-
ally suited for the investigation of compositional theo-
ries of natural language. Such investigation is necessary
in order to provide more-powerful natural-language inter-
faces than are currently available. For example, although
some NL interfaces can handle various constructs contain-
ing transitive verbs, the processing of verb adjuncts is still
very limited (e.g “When and with what did Hall discover
Phobos?"). There is no widely-accepted linguistic theory
which accounts for verb adjuncts. Ideally a compositional
Montague-like theory will be developed but this will require
an environment in which variations of grammars and se-
mantic rules can be investigated. Because natural language
is inherently ambiguous, and both leftmost and rightmost
parses are required, the accommodation of left-recursive
grammars will facilitate such investigation.
1.4 Overview
The goal of this research is to develop a method by which
top-down parsers can accommodate ambiguity and left re-
cursive grammars and be efficient enough for prototyping
natural-language processors whilst maintaining modularity
and clarity of code, None of the approaches, referred to ear-
lier, have achieved all of these objectives. However, they
have shed light on the problem and the solution that we
have developed owes much to this earlier work.
The new algorithm uses memoization to improve com-
plexity in a manner similar to that proposed by Norvig [13].
The new idea, introduced for the first time in this paper, is
to integrate a bound into the memoization process, which
is used to fail a parse branch when that branch contains a
cycle introduced through left recursion. This is similar to
the approaches proposed by Kuno [9], Shiel [14] and Lick-
man [11]. However, the new approach allows left-recursive
productions to be accommodated whilst achieving polyno-
mial complexity and preserving the modularity and clarity
of the processors.
The memotable that is created during the parsing process
contains much of the information that is required to con-
struct the potentially exponential number of parse trees. We
show how more information can be gathered by memoizing
parsers which correspond to each alternative right-hand side
of the productions in the grammar. The result is a useful
polynomial-sized compact representation of the parse trees.
2 TOP-DOWN PARSING/RECOGNITION
We describe top-down backtracking parsing from a perspec-
tive that corresponds to the construction of such parsers as
recursive-descent processors. For simplicity, we begin by
describing the algorithm with respect to recognizers, and
discuss parsers later.
We assume that the input is a sequence of tokens in-
put, of length input#, the members of which are accessed
through an index j. Irrespective of the programming lan-
guage used, the recursive-descent approach can be thought
of as requiring a recognizer to be built for each terminal of
the grammar, and the subsequent combination of these and
other recognizers to build recognizers for the non-terminals
of the grammar. For ambiguous grammars, the recognizers
can be thought of as functions which take an index jas
argument and which return a set of indices as result. Each
index in the result set corresponds to the position at which
the recognizer finished successfully recognizing a sequence
of tokens that began at position j. An empty result set in-
dicates that the recognizer failed to recognize any sequence
beginning at j.
As a running example, we consider a recognizer corre-
sponding to the grammar sS ::= ’s’ sS sS | empty
and input = “ssss”. We have chosen to use this exam-
ple grammar throughout the paper as it is highly ambiguous.
According to Aho and Ullman sS generates
2n
n
different leftmost parses of strings consisting of ns’s. For
example, for n=16there are over 35 million parses.
Although natural language is not as ambiguous as this, large
numbers of parses can be generated during lexical analysis.
We give a natural-language example later.
ACM SIGPLAN Notices 48 Vol. 41 (5), May 2006
2.1 Recognizers for single tokens
A recognizer term_t for a single terminal tof the grammar
takes an index jas input. If jis greater than the length of
the input, the recognizer returns an empty set. Otherwise,
it checks to see if the token at position jin the input
corresponds to the terminal t. If so, it returns a singleton
set containing j+1
, otherwise it returns the empty set.
For example, a basic recognizer for the terminal ’s’ is
defined as follows
term_s j = {} , if j > length of input
= {j + 1}, if input!j = ’s’
= {} , otherwise
2.2 Empty
The empty recognizer always succeeds and returns its input
index in a singleton set as result.
empty j = {j}
2.3 Alternate recognizers
A recognizer corresponding to a construct p|qin the
grammar is built by combining recognizers for pand q.
When the composite recognizer is applied to an index j,it
first applies pto j, then applies qto j, and then unites the
results. We introduce the operator orelse to denote the
process of combining alternate recognizers. This operator
can be implemented in various ways depending on the
programming language used.
(p orelse q) j = (p j) (q j)
For example, assuming that the input is “ssss”
(empty orelse term_s) 2 => {2, 3}
2.4 Sequence recognizers
A recognizer corresponding to a construct pqin the gram-
mar is built by combining recognizers for pand q. When
the composite recognizer is applied to an index j, it first
applies pto j, then applies qto each index in the set of
the results returned by p. It returns the union of each of
these applications of q. We introduce the operator then to
denote the process of sequencing recognizers.
(p then q) j = (map q (p j))
For example, assuming that the input is “ssss”
(term_s then term_s) 1 => {3}
2.5 An example recognizer
The operators orelse and then can be used to define
recognizers through recursion and mutual recursion. For
example, the following recognizer sS corresponds to the
example grammar sS ::= ’s’ sS sS | empty:
sS = (term_s then sS then sS)
orelse empty
Assuming that the input is “ssss”, the recognizer sS
returns a set of 5 results, the first 4 results corresponds to
proper prefixes of the input being recognized as an sS. The
result 5corresponds to the case were the whole input is
recognized as an sS.
sS 1 => {1, 2, 3, 4, 5}
2.6 Limitations of the approach
described so far
The reader may have noticed that the number of entries in
the output list of the recognizer is less than the number of
possible parses. This is owing to the fact that the results
generated during the process are united. Although this re-
duces the amount of work done in recognition, the process
still has exponential time complexity with respect to the
length of the input. This is because recognizers may be
repeatedly applied to the same index during the backtrack-
ing process which is induced by the operator orelse.In
section 3, we show how Norvig’s method can be used to
achieve polynomial complexity “memoizing” the recogni-
tion functions so that they reuse previously-computed re-
sults.
The second limitation is that the approach cannot be
used to build recognizers that correspond directly to left
recursive grammars. That is grammars in which a non-
terminal pderives the expression p .... Application of the
corresponding recognizers would result in infinite descent.
We show how to avoid this problem, without having to
transform the grammars, in section 5.
We conclude thsi section by noting that we have tried to
make the above description of top-down parsing indepen-
dent of programming-language or paradigm. However, our
formalism is influenced by the “parser combinator” style
which has been developed by the functional-programming
community. See for example Hutton [5], Koopman and
Plasmeijer [7], and Wadler [16].
3 MEMOIZATION
Norvig [22] has shown how the worst-case complexity of
top-down recognition can be improved from exponential to
cubic through a process of memoization. The basic idea
ACM SIGPLAN Notices 49 Vol. 41 (5), May 2006
is that a memotable is constructed during the recognition
process. At the beginning of the process the table is empty.
During the process it is updated with an entry for each
recognizer rithat is applied. The entry consists of a set of
pairs, the first component of each pair is an index jat which
the recognizer rihas been applied, the second component
is the set of results of the application of rito j.
The memotable is used as follows: whenever a recog-
nizer riis about to be applied to an index j, the memotable
is checked to see if that recognizer has ever been applied
to that index before. If so, the results from the memotable
are returned. If not, the recognizer is applied to the in-
dex and the memotable is updated with those results before
they are returned by the recognizer. For non-left-recursive
recognizers, this process ensures that no recognizer is ever
applied to the same index more than once.
One method of implementing memoization, that was
suggested by Norvig, is to have a global memotable, and
to encapsulate the recognizers which are to be memoized
in a function which performs the memotable lookup and
update. In general the process can be implemented in var-
ious ways depending on the programming language used.
We introduce the operator memoize to indicate that a rec-
ognizer has been memoized. This operator takes a string
which denotes the name of the recognizer, together with
the recognizer itself as arguments. The name is used for
memotable lookup and update. For example, consider the
following memoized recognizer:
msS = memoize "msS"
((ms then msS then msS)
orelse empty)
ms = memoize "ms" term_s
The operator memoize is defined as follows:
memoize name ri j
if lookup succeeds,
return memotable result
else
apply ri to j
update’ table with results
return results
The recognizer msS is the same as sS in all respects
except that it has cubic complexity (see later).
4 ACCOMMODATING LEFT RECURSION
In order to accommodate left recursive productions, we
simply use another table ctable during the memoization
process. The new table contains a set of values value cij
denoting the number of times each recognizer rihas been
applied to an index j. For non-left-recursive recognizers
this count will be at most one, as the memotable lookup
will prevent such recognizers from ever being applied to
the same input twice. However, for left-recursive recogniz-
ers, the count is increased on recursive descent (owing to
the fact that the memotable is only updated on the recursive
ascent after the recognizer has been applied). Application
of a recognizer Nto an input jis failed whenever the ap-
plication count already exceeds the length of the remaining
input plus 1. When this happens no parse is possible (other
than spurious parses which could occur with circular gram-
mars). As illustration, consider the following branch being
created during the parse of two remaining tokens on the
input:
N
/\
NA
/\
NB
/\
PC
/
Q
/
N
The last call of N should be failed owing to the fact
that, irrespective of what A, B, and C are, either they must
require at least one input token, orelse they must rewrite to
empty. If they all require a token, then the parse cannot
succeed. If any of them rewrite to empty, then the grammar
is circular (N is being rewritten to N) and the last call should
be failed.
Notice that simply failing a parse when a branch is
longer than the length of the input is incorrect as this can
occur in a correct parse if recognizers are rewritten into
other recognizers which do not have “token requirements
to the right”. For example, we cannot fail the parse at
Por Qas these could rewrite to empty without indicating
circularity.
To make use of the new table, we simply modify the
memoize operator to check and increment the cij counters
at appropriate points in the computation: if the memotable
lookup for the recognizer riand the index jproduces
a non-empty result, that result is returned with the two
tables unchanged. However, if the memotable does not
contain a result for that recognizer and that input, cij is
checked to see if the recognizer should be failed because
it has descended too far through left-recursion. If so,
memoize returns an empty set as result with the tables
unchanged. Otherwise, the counter cij is incremented and
the recognizer riis applied to j, and the memotable is
updated with the result before it is returned.
ACM SIGPLAN Notices 50 Vol. 41 (5), May 2006
memoize name ri j
if lookup succeeds,
return memotable results
else
if cij > (input length)-j+1
return {},
else
increment cij counter in ctable
apply ri to j
update’ memotable with results
return results
The memotable update function update’ is slightly dif-
ferent from the update function given in section 3. This
is owing to the fact that, on the recursive descent a recog-
nizer is only applied to an input if there is no entry in the
memotable. In the original memoization function, the re-
sults returned by the recognizer on ascent are simply added
to the memotable. However, if left-recursion occurs, the
memotable may have been updated by the same recognizer
for the same input lower in the parse tree. Therefore on as-
cent it is necessary to unite the results of each application
of the recognizer with memotable results which may have
been added by calls lower in the parse tree.
Using this approach, recognizers may now be defined
using explicit (as in mSL) or hidden (as in mZ) left recursion.
For example:
mSL = memoize "mSL"
((mSL then mSL then2 ms)
orelse2 empty2)
ms = memoize "ms" term_s
mZ = memoize "mZ" (mz orelse mY)
mY = memoize "mY" (mZ then mSL)
mz = memoize "mz" term_z
The following are example applications of mZ, assuming
that the tables were initially empty. The second component
is the table showing for each recognizer the number of times
it visited each position in the input. The last component is
the table showing for each recognizer the positions at which
it was applied and the results of that application.
Grammar mZ ::= ’z’| mY
mY ::= mZ mSL
input = "zss"
mZ 1 ([],[]) => {2,3,4}
ctable =
{("mZ", {(1,3)}),
("mz", {(1,1)}),
("mY", {(1,4)}),
("mSL",{(2,2),(3,1),(4,0)}),
("ms", {(2,1),(3,1),(4,1)})}
memotable =
{("mz", {(1,{2})}),
("mY", {(1,{2,3,4})}),
("mZ", {(1,{2,3,4})}),
("mSL",{(2,{2,3,4}),(3,{3,4}),(4,{4})}),
("ms" ,{(2,{3}),(3,{4}),(4,{})})}))
5 INFORMAL DISCUSSION
OF TERMINATION
Basic recognizers such as term_s and the recognizer
empty clearly terminate for finite input. Other recogniz-
ers that are defined through mutual and nested recursion
are applied by the memoize function which takes a rec-
ognizer and an index jas input and which accesses two
tables ctable and memotable. If a recognizer has an en-
try in memotable for the index j, it is not applied and
therefore we do not need to consider the size of the argu-
ments. If it does not have an entry in memotable, we must
consider two cases of possible recursion: 1) it is not a left-
recursive call and therefore at least one other recognizer
must have been applied before it which consumed at least
one token and increased the index by at least one before
the call, 2) it is a left-recursive call and the index argument
has not been changed. In this case, memoize increments
the left-recursion counter in ctable for that recognizer and
that index before the recursive call is made. Therefore an
appropriate measure function maps the index and ctable
values to a number which increases by at least one for each
recursive call. The fact that the number is bounded by con-
ditions imposed on the size of the index and on the sizes
of the left-recursion counters establishes termination.
6 COMPLEXITY
In the following complexity analysis, we assume that the
sets of results are represented as ordered lists, as are the
entries in the tables. We now show that memoized non-
left-recursive and left-recursive recognizers have a worst-
case time complexities of O(n3)and O(n4)respectively,
where n = input#.
Assumption 2 — Elementary operations: We assume that
the following operations require a constant amount of time:
1. Testing if two values are equal, less than, etc.
2. Extracting the value of a tuple.
3. Adding an element to the front of a list.
4. Obtaining the value of the ith element of a list whose
length depends on R#but not on input#.
ACM SIGPLAN Notices 51 Vol. 41 (5), May 2006
Assumption 3 — Merging of lists depends on their length.
Lemma 4 — Memotable lookup and update, checking and
incrementing left-recursion counters: From lemma 1 and
the definition of memoize,memotable has size O(n2)
and ctable has size O(n) and . The function lookup is
O(n) requiring a search of memotable for the recognizer
name and then a search of the O(n) list of results (one for
each index). The function update is O(n) requiring the
same O(n) search as lookup plus a possible O(n) merge
of results. Checking for the value of a left-recursion counter
in ctable and increment of such a counter is clearly O(n).
Lemma 5 — Basic recognizers Application of a basic rec-
ognizer is at most O(n) requiring the use of an index j
into the input. Application of empty is also O(n), simply
enclosing a single index in a list.
Lemma 6 — Alternation: Assuming that the recognizers rp
and rqhave been applied to an index jand that the results
have already been computed, application of a memoized
recognizer rporelse rqto jinvolves the following
steps:
1. one memotable lookup O(n)
2. and, if the recognizer has not been applied before:
a. one left-recursion counter check — O(n)
b. and, if the counter check permits:
merging of two result lists — O(n)
one memotable update O(n)
Lemma 7 — Sequencing: Assume that the recognizer rp
has been applied to an index jand that the results res
have been computed. In the worst case, res = [j,
j+1,j+2, .. n+1]. Assume also that j’ res rq
j’ has been computed. Then, application of a memoized
recognizer (rpthen rq) to an index jinvolves:
1. one memotable lookup O(n)
2. and, if the recognizer has not been applied before:
a. one left-recursion counter check — O(n)
b. and, if the counter check permits:
application of rqto each index in res and
merging of the result lists— O(n2).
one memotable update O(n)
Proof of O(n3)complexity for non-left-recursive recogniz-
ers.
In the worst case, each recognizer riRis applied to
each of the nindices at most once. The cost of an appli-
cation to one index is:
Case 1: For basic recognizers the cost is O(n) — Lemma 5.
Case 2: For recognizers of the form (rporelse rq)
the cost is O(n) — Lemma 6.
Case 3: For recognizers of the form (rpthen rq)the
cost is O(n2)— Lemma 7.
In practice recognizers can be a combination of more than
two recognizers. However, from definition 2 the number of
component recognizers is finite and is independent of n.
It follows that the total cost is O(n3).
Proof of O(n4)complexity for left-recursive recognizers.
In the worst case, each recognizer riRis applied to each
of the nindices at most ntimes before being curtailed. It
follows that the total cost is O(n4).
7 IMPLEMENTATION AND
EXPERIMENTAL RESULTS
The approach described in this paper has been implemented
using parser combinators in the pure functional program-
ming language Haskell using an method called “monadic
memoization” [3]. Details, including proofs of termination
and complexity, are available in Frost and Hafiz [2]. An
example recognizer was constructed corresponding to the
grammar sS ::= ’s’ sS sS | empty and applied to se-
quences of ’s’s of varying length. The results in the table
at the end of this paper were obtained using the Haskell
interpreter Hugs 98 on a PC with 0.5 GB of RAM. The
results appear to support our claim that we have avoided
exponential behaviour.
It should be noted that the example grammar that we
have used so far is highly ambiguous, far exceeding any am-
biguity found in natural language. Also, many constructs in
natural language are defined without use of left recursion.
Consequently, we have also investigated the performance
of our approach with respect to a small natural-language
grammar. The following is the definition of the recognizer
in Haskell. This example illustrates the close correspon-
dence between the grammar and the program code when
parser combinators are used (obviating the need to give
the grammar separately in this example). The recognizer
sent recognizes sentences in a very small subset of Eng-
lish. tp stands for termphrase, det for determiner, and vp
for verbphrase:
sent = memoize "sent"
(tp ‘then2‘ vp ‘then2‘ tp)
tp = memoize "tp"
(simple_tp
‘orelse2‘ (tp ‘then2‘ join ‘then2‘ tp))
join = memoize "join"
(term2 "and" ‘orelse2‘ term2 "or")
ACM SIGPLAN Notices 52 Vol. 41 (5), May 2006
simple_tp = memoize "simple_tp"
(proper_noun
‘orelse2‘ det_phrase)
proper_noun = memoize "proper_noun"
(term2 "helen
‘orelse2‘ term2 "john"
‘orelse2‘ term2 "pat")
det_phrase = memoize "det_phrase"
(det ‘then2‘ noun)
det = memoize "det"
(term2 "every"
‘orelse2‘ term2 "some")
noun = memoize "noun"
(term2 "boy"
‘orelse2‘ term2 "girl"
‘orelse2‘ term2 "man"
‘orelse2‘ term2 "woman")
vp = memoize "vp"
(verb
‘orelse2‘ (vp ‘then2‘ join ‘then2‘ vp))
verb = memoize "verb"
(term2 "knows"
‘orelse2‘ term2 "respects"
‘orelse2‘ term2 "loves")
Application of tp to the ambiguous termphrase:
["every","boy","or","some","girl",
"and","helen","and","john","or","pat]
requires 57,131 reductions, 102,477 cells, and returns the
following result:
([3,6,8,10,12],
([("tp",[(1,11),(4,8),(7,5),(9,3),(11,1)]),
("simple_tp" ,[(1,1),(4,1),(7,1),etc.]),
("proper_noun",[(1,1),(4,1),(7,1),etc.]),
etc.
[("proper_noun",[(1,[]),(4,[]),(7,[8]),
(9,[10]),(11,[12])]),
("det", [(1,[2]),(4,[5]),(7,[]),
(9,[]), (11,[])]),
("noun", [(2,[3]),(5,[6])]),
("det_phrase", [(1,[3]),(4,[6]),(7,[]),
(9,[]),(11,[])]),
("simple_tp" ,[(1,[3]),(4,[6]),(7,[8]),
(9,[10]),(11,[12])]),
("tp", [(1,[3,6,8,10,12]),
(4,[6,8,10,12]),
(7,[8,10,12]),
(9,[10,12]),(11,[12])]),
("join", [(3,[4]),(6,[7]),(8,[9]),
(10,[11]),(12,[])])]))
Application of sent to the list of tokens corresponding
to the highly-ambiguous sentence “every boy or some girl
and helen and john or pat knows and respects or loves every
boy or some girl and pat or john and helen" took 408,454
reductions, used 691,504 cells, and returned results in less
approximately 0.5 seconds.
The prototype processor is clearly not fast. However,
the combinators were not optimized, and Hugs 98 is an
interpreted version of Haskell. Our approach to top-down
parsing could be implemented in a more efficient program-
ming environment.
8 COMPACT REPRESENTATION
OF PARSE TREES
Reference to the example application given above shows
that most of the information for reconstructing the parse
trees is already available in the memotable. For example,
the memotable output shows that the input contains three
proper nouns at positions 7, 9 and 11, etc. Additional infor-
mation could be collected during the recognition process by
naming and memoizing the alternative recognizers on the
right-hand sides of grammar productions. This, together
with the grammar, would provide all of the information
necessary to extract the potentially exponential number of
parses from the memotable. The memotable is bounded by
the number of recognizers, the number of indices, and the
sizes of the result sets. The latter two of which depend
on the length of the input. Consequently, the memotable
has worst-case size O(n2)and provides a compact repre-
sentation of the possibly-exponential number of parse trees
which appears to be similar to that proposed by Tomita [24].
The major advantage of creating compact representation
of parse trees is that syntactic agreement rules, together with
semantic rules, can be used to prune out sub-trees which
are shared by many possible parses.
9 CONCLUDING COMMENTS
Future work includes:
1. Extending the approach to parsers and evaluators.
2. Optimizing the combinators used in the Haskell im-
plementation, using the techniques of Koopman and
Plasmeijer [14]
3. Testing the approach on large natural-language gram-
mars.
10 ACKNOWLEDGEMENTS
Richard Frost acknowledges the support of NSERC the Nat-
ural Sciences and Engineering Research Council of Canada.
ACM SIGPLAN Notices 53 Vol. 41 (5), May 2006
length of
input n
number of leftmost parses with
S ::= ’s’ S S | empty
2n
n
Note that the number of partial parses
consistent with snis larger than this
number of reductions
mS ::= ’s’ mS mS | empty
mSL ::= mSL mSL ’s’ | empty
mS without
memoization
(checks all partial
parses)
mS with
memoization
mSL
3 5 2,781 2,834 5,990
6 132 65,049 7,081 28,366
12 20,812 out of space 23,297 206,903
24 128,990,414,734 99,469 2,005,561
48 1.313278982422e+26 424,929 17,125,991
96 huge 2,620,807 out of space
192 18,119,356
384 134,091,390
11 REFERENCES
1. Camarao, C., Figueiredo, L. and Oliveira, R.,H. (2003) Mim-
ico: A Monadic Combinator Compiler Generator. Journal of
the Brazilian Computer Society Vol 9(1).
2. Frost, R. A. and Hafiz, R. (2006) Using monads to accom-
modate ambiguity and left recursion with parser combinators.
Technical Report 06–007 School of Computer Science, Uni-
versity of Windsor, Canada.
3. Frost, R. A. (2003) Monadic memoization — Towards
Correctness-Preserving Reduction of Search. AI 2003 eds.
Y. Xiang and B. Chaib-draa. LNAI 2671 66–80.
4. Frost, R. A. (1993) Guarded attribute grammars. Software
Practice and Experience.23 (10) 1139–1156.
5. Hutton, G. (1992) Higher-order functions for parsing. J.
Functional Programming 2 (3) 323–343.
6. Johnson, M. (1995) Squibs and Discussions: Memoization in
top-down parsing. Computational Linguistics 21 (3) 405–417.
7. Koopman, P. and Plasmeijer, R. (1999) Efficient combinator
parsers. In Implementation of Functional Languages, LNCS,
1595:122 138. Springer-Verlag.
8. Koskimies, K. (1990) Lazy recursive descent parsing for mod-
ular language implementation. Software Practice and Experi-
ence, 20 (8) 749–772.
9. Kuno, S. (1965) The predictive analyzer and a path elimination
technique. Communications of the ACM 8(7) 453 — 462.
10. Leermakers, R. (1993) The Functional Treatment of Parsing.
Kluwer Academic Publishers, ISBN 0–7923–9376–7.
11. Lickman, P. (1995) Parsing With Fixed Points. Master’s
Thesis, University of Cambridge.
12. Nederhof, M. J. and Koster, C. H. A. (1993) Top-Down Pars-
ing for Left-recursive Grammars. Technical Report 93–10 Re-
search Institute for Declarative Systems, Department of Infor-
matics, Faculty of Mathematics and Informatics, Katholieke
Universiteit, Nijmegen.
13. Norvig, P. (1991) Techniques for automatic memoisation with
applications to context-free parsing. Computational Linguis-
tics 17 (1) 91 - 98.
14. Shiel, B. A. 1976 Observations on context-free parsing. Tech-
nical Report TR 12–76, Center for Research in Computing
Technology, Aiken Computational Laboratory, Harvard Uni-
versity.
15. Tomita, M. (1985) Efficient Parsing for Natural Language.
Kluwer, Boston, MA.
16. Wadler, P. (1985) How to replace failure by a list of successes,
in P. Jouannaud (ed.) Functional Programming Languages
and Computer Architectures Lecture Notes in Computer Sci-
ence 201, Springer-Verlag, Heidelberg, 113.
ACM SIGPLAN Notices 54 Vol. 41 (5), May 2006
... In conducting the work described in this thesis-report, the candidate worked closely with Dr. Frost, his supervisor, and he also collaborated with Dr. Callaghan of the University of Durham. [11]) and (Frost, Hafiz and Callaghan, 2006, [12]). ...
... There are several reasons why it is important for an NL-parsing system to accommodate left-recursive CFGs: the left-recursion problem of top-town recursive-descent parsing is 'eliminating a leftrecursive production-rule by converting it into a non-left-recursive one'. The 3. As mentioned in [11], if left-recursive grammars could be used with top-down parsing, they would provide a better framework for investigating NL theories in order to achieve more efficient natural-language interfaces. For example, to test and investigate compositional Montague-like theories (for processing verb adjuncts such as "When and with what did Hall discover Phobos?") the parsing-system needs to achieve all possible ambiguous leftmost and rightmost derivations. ...
... The measure-function ║.║ maps a memoized recursive parser (p i )'s input-argument (start-position (j), context, memo-table 11 ) to a natural-number as follows: ...
... for i ← 0 to size of term list do 10 child node ← Parse(term list i , pattern.non terminals i ) ; 11 add child node to child nodes; 12 add Node(type, p , r, child nodes) to nodes; 13 if nodes is empty then 14 final node ← Node(type, p , null, {}); 15 else 16 final node ← arg max n∈nodes r(n) ; 17 if final node is not fully parsed then 18 add final node to induction nodes ; 19 return final node Algorithm 1: Pseudocode of the main function parse of the top-down parser. ...
... Since the ambiguity of the grammar may make parsing computationally infeasible, several optimization techniques are used. Memoization [11] is used to reduce the complexity from exponential time to O(n 3 ) [12], where n is the length of the sentence. The parser does not support ε productions mainly because the grammar induction will not produce them. ...
Article
Semantic parsing methods are used for capturing and representing semantic meaning of text. Meaning representation capturing all the concepts in the text may not always be available or may not be sufficiently complete. Ontologies provide a structured and reasoning-capable way to model the content of a collection of texts. In this work, we present a novel approach to joint learning of ontology and semantic parser from text. The method is based on semi-automatic induction of a context-free grammar from semantically annotated text. The grammar parses the text into semantic trees. Both, the grammar and the semantic trees are used to learn the ontology on several levels -- classes, instances, taxonomic and non-taxonomic relations. The approach was evaluated on the first sentences of Wikipedia pages describing people.
... The framework was applied to defining functional parsers and parser combinators, this time using lists to represent non-determinism, but could not handle left recursive grammars or left-recursion in general. In subsequent publications, Frost and his co-workers developed the idea, adopting a monadic approach to state threading, this time in Haskell (Frost, 2003), developing a way to handle left-recursive grammars, though not by using Johnson's method, but by managing and limiting the depth of left-recursive calls (Frost and Hafiz, 2006), and adding the ability to return a compact representation of multiple parse trees (Frost et al., 2007(Frost et al., , 2008. Frost et al acknowledge Johnson's continuation passing approach but do not adopt it; indeed, they express surprise that it has not received much attention and suggest that that this may be because the approach is "somewhat convoluted and extending it to return packed representations of parse trees [. . . ...
... However, the time complexity is still exponential in the depth of the left recursion. The later work of Frost and Hafiz (2006) (see § 1.2) improves on this, reaching O(n 4 ) time complexity in the length of the input for left recursive grammars. In comparison, the system described here handles left recursion without having to look ahead to the end of the input sequence to limit the depth of left recursion, and achieves the same O(n 3 ) theoretical time complexity as Earley's chart parser. ...
Article
Full-text available
Memoisation, or tabling, is a well-known technique that yields large improvements in the performance of some recursive computations. Tabled resolution in Prologs such as XSB and B-Prolog can transform so called left-recursive predicates from non-terminating computations into finite and well-behaved ones. In the functional programming literature, memoisation has usually been implemented in a way that does not handle left-recursion, requiring supplementary mechanisms to prevent non-termination. A notable exception is Johnson's (1995) continuation passing approach in Scheme. This, however, relies on mutation of a memo table data structure and coding in explicit continuation passing style. We show how Johnson's approach can be implemented purely functionally in a modern, strongly typed functional language (OCaml), presented via a monadic interface that hides the implementation details, yet providing a way to return a compact represention of the memo tables at the end of the computation.
... Recursive programs can be memoized by hand. On the other hand, there have been implemented many automatic techniques to apply dynamic programming techniques to recursions based on memoization as in [95,87,46,1] initially implemented for functional languages. As for imperative languages, there exist many libraries and tools that have been developed to efficiently automatize memoization for recursive functions such as the functools functions cache and lru_cache decorators built-in in Python [47] and C-Memo function memoization library for C programs [25]. ...
Thesis
In this thesis, we introduce Rec2Poly, a framework for speculative rewriting of recursiveprograms as affine loops that are candidates for efficient optimization and paralleliza-tion. Rec2Poly seeks a polyhedral-compliant run-time control and memory behavior inrecursions making use of an offline profiling technique. When it succeeds to model thebehavior of a recursive program as affine loops, it can use the affine loop model to automatically generate an optimized and parallelized code based on the inspector-executorstrategy for the next executions of the program. The inspector involves a light version ofthe original recursive program whose role is to collect, generate and verify run-time in-formation that is crucial to ensure the correctness of the equivalent affine iterative code.The executor is composed of the affine loops that can be parallelized or even optimizedusing the polyhedral model.
... The PEG parsers that use memoization [22] run in linear time and are called packrat parsers [12]. The memoization can be used to accommodate the ambiguity and left recursion in polynomial time [14]. There are general parsing strategies that can produce all possible parse trees (a parse forest). ...
Article
Full-text available
The article describes a new and efficient algorithm for parsing, called Tunnel Parsing, that parses from left to right on the basis of a context-free grammar without left recursion and rules that recognize empty words. The algorithm is applicable mostly for domain-specific languages. In the article, particular attention is paid to the parsing of grammar element repetitions. As a result of the parsing, a statically typed concrete syntax tree is built from top to bottom, that accurately reflects the grammar. The parsing is not done through a recursion, but through an iteration. The Tunnel Parsing algorithm uses the grammars directly without a prior refactoring and is with a linear time complexity for deterministic context-free grammars.
... Prior progress: It was found that direct or indirect left recursive rules could be rewritten into right recursive form (with only a weak equivalence to the left recursive form) [16], although the rewriting rules can be complex, especially when semantic actions are involved, and the resulting parse tree can diverge quite significantly from the structure of the original grammar, making it more difficult to create an abstract syntax tree from the parse tree. With some extensions, packrat parsing can be made to handle left recursion [16,19,20,44], although usually with loss of linear-time performance, which is one of the primary reasons packrat parsing is chosen over other parsing algorithms [44]. Some workarounds to supporting left recursion in recursive descent parsers only handle indirect left recursion, not direct left recursion. ...
Preprint
Full-text available
A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar, causing the structure of recursive calls to directly parallel the structure of the grammar. Recursive descent parsers can take time exponential in the length of the input and the depth of the parse tree, however a memoized recursive descent parser or packrat parser is able to parse in time linear in the length of the input and the depth of the parse tree. Recursive descent parsers are extremely simple to write, but suffer from two significant problems: (i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it is difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Surprisingly, both problems can be solved by parsing the input in reverse. The pika parser is a new type of packrat parser that employs dynamic programming to parse the input from right to left, bottom-up -- the reverse of the standard recursive descent order of top-down, left to right. This reversed parsing order enables pika parsers to directly handle left-recursive grammars, simplifying grammar writing, and enables pika parsers to directly and optimally recover from syntax errors, which is a crucial property for IDEs and compilers. Pika parsing maintains the linear-time performance characteristics of packrat parsing. Several new insights into precedence, associativity, and left recursion are presented.
... More efficient parsers can be written using dynamic programming techniques to avoid redundant computations; these chart parsers include the bottomup CKY algorithm (Younger, 1967) and Earley's top-down parser (Earley, 1970). The close relationship between Earley's algorithm and tabling or memoization in general purpose logic programming (also known as Earley deduction) is well-known (Pereira and Warren, 1983;Porter, 1986), and can be carried over very elegantly to functional programming languages such as Scheme and Haskell using memoizing parser combinators (Frost and Hafiz, 2006;Johnson, 1995;Norvig, 1991). ...
Chapter
Recent developments in computational linguistics offer ways to approach the analysis of musical structure by inducing probabilistic models (in the form of grammars) over a corpus of music. These can produce idiomatic sentences from a probabilistic model of the musical language and thus offer explanations of the musical structures they model. This chapter surveys historical and current work in musical analysis using grammars, based on computational linguistic approaches. We outline the theory of probabilistic grammars and illustrate their implementation in Prolog using PRISM. Our experiments on learning the probabilities for simple grammars from pitch sequences in two kinds of symbolic musical corpora are summarized. The results support our claim that probabilistic grammars are a promising framework for computational music analysis, but also indicate that further work is required to establish their superiority over Markov models.
... This proposed parser can parse Arabic sentences from real documents and also capable for identifying conjunctions, exceptive particles, preposition etc in the Arabic language. [12] proposed a new top down parsing algorithm. This algorithm accommodates ambiguity and left recursion in polynomial time. ...
Article
Full-text available
Natural language recognization is a popular topic of research as it covers many areas such as computer science, artificial intelligence, theory of computation, and machine leaning etc. Many of the techniques are used for natural language recognization by the researchers, parsing is one of them. The aim to propose this paper is to implement nondeterministic pushdown automata (NPDA) for the English Language (ELR-NPDA) that can modernize Context Free Grammar (CFG) for English language and then refurbish into Nondeterministic Pushdown Automata (NPDA). This converting procedure can uncomplicatedly parse legitimate English language sentences. Parsing can be organized by Nondeterministic Pushdown Automata (NPDA) that used push down stack and input tape for recognizing English language sentences. To formulate this NPDA convertor we have to exchange Context Free Grammar into Chomsky Normal Form (CNF). The move toward this is more appropriate because it uses nondeterministic approach of PDA that can improve language recognizing capabilities as compare to other parsing approach. Automata (PDA), Nondeterministic pushdown automata (NPDA), Context Free Grammar (CFG), Chomsky Normal Form (CNF).
... down parsers that involves limiting the depth of the (otherwise) infinite recursion that arises from left-recursive rules to the length of the remaining input plus 1 [FH06]. ...
Article
Full-text available
This article describes a novel approach to compiler generation, based on monadic combinators. A prototype of a compiler generator, called M
Article
Full-text available
It is shown that a process similar to Earley's algorithm can be generated by a simple top-down backtracking parser, when augmented by automatic memoization. The memoized parser has the same complexity as Earley's algorithm, but parses constituents in a different order. Techniques for deriving memo functions are described, with a complete implementation in Common Lisp, and an outline of a macro-based approach for other languages.
Conference Paper
Full-text available
Parser combinators enable the construction of recursive descent parsers in a very clear and simple way. Unfortunately, the resulting parsers have a polynomial complexity and are far too slow for realistic inputs. We show how the speed of these parsers can be improved by one order of magnitude using continuations. These continuations prevents the creation of intermediate data structures. Furthermore, by using an exclusive or-combinator instead of the ordinary or-combinator the complexity for deterministic parsers can be reduced from polynomial to linear. The combination of both improvements turn parser combinators from a beautiful toy to a practically applicable tool which can be used for real world applications. The improved parser combinators remain very easy to use and are still able to handle ambiguous grammars.
Conference Paper
Full-text available
Memoization is a well-known method which makes use of a table of previously-computed results in order to ensure that parts of a search (or computation)s pace are not revisited. A new technique is presented which enables the systematic and selective memoization of a wide range of algorithms. The technique overcomes disadvantages of previous approaches. In particular, the proposed technique can help programmers avoid mistakes that can result in sub-optimal use of memoization. In addition, the resulting memoized programs are amenable to analysis using equational reasoning. It is anticipated that further work will lead to proof of correctness of the proposed memoization technique.
Book
Foreword by Fernando Pereira. Preface. 1. Context-Free Grammars. 2. Bunch Notation. 3. Grammar Interpretations. 4. Recursive Descent. 5. Grammar Transformations. 6. Recursive Ascent. 7. Parse Forest. 8. Attribute Grammars. 9. LR Parsers. 10. Some Notes. References. Index.
Article
Contrary to a widely-held belief, it is possible to construct executable specifications of language processors that use a top-down parsing strategy and which have structures that directly reflect the structure of grammars containing left-recursive productions. A novel technique has been discovered by which the non-termination that would otherwise occur is avoided by ‘guarding’ top-down left-recursive language processors by non-left-recursive recognizers. The use of a top-down parsing strategy increases modularity and the use of left-recursive productions facilitates specification of semantic equations. A combination of the two is of significant practical value because it results in modular and expressively clear executable specifications of language processors. The new approach has been tested in an attribute grammar programming environment that has been used in a number of projects including the development of natural language interfaces, SQL processors and circuit design transformers within a VLSI design package.
Article
Some of the characteristic features of a predictive analyzer, a system of syntactic analysis now operational at Harvard on an IBM 7094, are delineated. The advantages and disadvantages of the system are discussed in comparison to those of an immediate constituent analyzer, developed at the RAND Corporation with Robinson's English grammar. In addition, a new technique is described for repetitive path elimination for a predictive analyzer, which can now claim efficiency both in processing time and core storage requirement.