Conference PaperPDF Available

Faster Scannerless GLR Parsing


Abstract and Figures

Analysis and renovation of large software portfolios requires syntax analysis of multiple, usually embedded, languages and this is beyond the capabilities of many standard parsing techniques. The traditional separation between lexer and parser falls short due to the limitations of tokenization based on regular expressions when handling multiple lexical grammars. In such cases scannerless parsing provides a viable solution. It uses the power of context-free grammars to be able to deal with a wide variety of issues in parsing lexical syntax. However, it comes at the price of less efficiency. The structure of tokens is obtained using a more powerful but more time and memory intensive parsing algorithm. Scannerless grammars are also more non-deterministic than their tokenized counterparts, increasing the burden on the parsing algorithm even further. In this paper we investigate the application of the Right-Nulled Generalized LR parsing algorithm (RNGLR) to scannerless parsing. We adapt the Scannerless Generalized LR parsing and filtering algorithm (SGLR) to implement the optimizations of RNGLR. We present an updated parsing and filtering algorithm, called SRNGLR, and analyze its performance in comparison to SGLR on ambiguous grammars for the programming languages C, Java, Python, SASL, and C++. Measurements show that SRNGLR is on average 33% faster than SGLR, but is 95% faster on the highly ambiguous SASL grammar. For the mainstream languages C, C++, Java and Python the average speedup is 16%.
Content may be subject to copyright.
Faster Scannerless GLR Parsing
Giorgios Economopoulos, Paul Klint, and Jurgen Vinju
Centrum voor Wiskunde en Informatica (CWI), Kruislaan 413, 1098 SJ Amsterdam,
The Netherlands
Abstract. Analysis and renovation of large software portfolios requires
syntax analysis of multiple, usually embedded, languages and this is
beyond the capabilities of many standard parsing techniques. The
traditional separation between lexer and parser falls short due to the lim-
itations of tokenization based on regular expressions when handling mul-
tiple lexical grammars. In such cases scannerless parsing provides a viable
solution. It uses the power of context-free grammars to be able to deal
with a wide variety of issues in parsing lexical syntax. However, it comes
at the price of less efficiency. The structure of tokens is obtained using a
more powerful but more time and memory intensive parsing algorithm.
Scannerless grammars are also more non-deterministic than their tok-
enized counterparts, increasing the burden on the parsing algorithm even
In this paper we investigate the application of the Right-Nulled Gener-
alized LR parsing algorithm (RNGLR) to scannerless parsing. We adapt
the Scannerless Generalized LR parsing and filtering algorithm (SGLR)
to implement the optimizations of RNGLR. We present an updated pars-
ing and filtering algorithm, called SRNGLR, and analyze its performance
in comparison to SGLR on ambiguous grammars for the programming
languages C, Java, Python, SASL, and C++. Measurements show that
SRNGLR is on average 33% faster than SGLR, but is 95% faster on
the highly ambiguous SASL grammar. For the mainstream languages C,
C++, Java and Python the average speedup is 16%.
1 Introduction
For the precise analysis and transformation of source code we first need to parse
the source code and construct a syntax tree. Application areas like reverse en-
gineering, web engineering and model driven engineering specifically deal with
many different languages, dialects and embeddings of languages into other lan-
guages. We are interested in the construction of parsing technology that can
service such diversity; to allow a language engineer to experiment with and effi-
ciently implement parsers for real and complex language constellations.
A parser is a tool, defined for a specific grammar, that constructs a syntac-
tic representation (usually in the form of a parse tree) of an input string and
determines if the string is syntactically correct or not. Parsing often includes a
scanning phase which first splits the input string into a list of words or tokens.
This list is then further analyzed using a more powerful parsing algorithm. This
O. de Moor and M. Schwartzbach (Eds.): CC 2009, LNCS 5501, pp. 126–141, 2009.
Springer-Verlag Berlin Heidelberg 2009
Faster Scannerless GLR Parsing 127
scanning/parsing dichotomy is not always appropriate, especially when parsing
legacy languages or embedded languages. Scanners are often too simplistic to
be able to deal with the actual syntax of a language and they prohibit modu-
lar implementation of parsers. Scannerless parsing [16,17,25] is a technique that
avoids such issues that would be introduced by having a separate scanner [5].
Intuitively, a scannerless parser uses the power of context-free grammars instead
of regular expressions to tokenize an input string.
The following Fortran statement is a notorious example of scanning issues [1]:
DO5I=1.25 . It is not until the decimal point that it becomes clear that
we are dealing here with an assignment to the variable DO5I.1However, in the
slightly different statement: DO 5 I = 1,25 ,DO is a keyword and the statement
as a whole is a loop construct. This example highlights that tokenization using
regular expressions, without a parsing context, can easily be non-deterministic
and even ambiguous. In order to restrict the number of possibilities, scanners
usually apply several implicit rules like, e.g., Prefer Longest Match,Prefer Key-
words,Prefer First Applicable Rule. The downside of such disambiguation is that
the scanner commits itself to one choice of tokens and blocks other interpreta-
tions of the input by the parser. A scannerless parser with enough lookahead
does not have this problem.
Another example is the embedding of Java code in AspectJ definitions and vice
versa. If a scanner is needed for the combination of the two languages, you may
end up with reserving the new AspectJ keywords from the Java code. However,
existing Java code may easily contain such identifiers, resulting in parsing errors
for code that was initially parsed correctly. One approach that could avoid this
problem would be to use two separate scanners: one that is active while parsing
pure AspectJ code and another that is active while parsing pure Java code. Once
again, the parsing context would be used to decide which scanner is used in the
tokenization. This problem does not exist when using a scannerless parser [6].
In a classical scanner/parser approach the scanner makes many decisions re-
garding tokenization. In a scannerless parser these decisions are postponed and
have to be made by the parser. Consequently, scannerless parsers generally have
to deal with more non-determinism than before, so the deterministic LR parsing
algorithms can no longer be used. However, it turns out that the non-determinism
introduced by the removal of the scanner can be gracefully handled by General-
ized LR (GLR) parsing algorithms [20,14,15].
The advantages of a one phase scannerless parser over a traditional two phase
scanner and parser may not be immediately obvious. The following list highlights
the main benefits of scannerless parsing:
Computational power: lexical ambiguity is a non-issue and full definition of
lexical syntax for real languages is possible.
Modularity: languages with incompatible lexical syntaxes can be combined
Scope: to generate parsers for more languages, including ambiguous, embed-
ded and legacy languages.
1Recall that Fortran treats spaces as insignificant, also inside identifiers.
128 G. Economopoulos, P. Klint, and J. Vinju
Simplicity: no hard-wired communication between scanning and parsing.
Declarativeness: no side-effects and no implicit lexical disambiguation rules
So, on the one hand a language engineer can more easily experiment with
and implement more complex and more diverse languages using a parser gen-
erator that is based on Scannerless GLR parsing. On the other hand there is a
cost. Although it does not have a scanning phase, scannerless parsing is a lot
more expensive than its two-staged counterpart. The structure of tokens is now
retrieved with a more time and memory intensive parsing algorithm. A collec-
tion of grammar rules that recognizes one token type, like an identifier could
easily have 6 rules, including recursive ones. Parsing one character could there-
fore involve several GLR stack operations, searching for applicable reductions
and executing reductions. Consider an average token length of 8 characters and
an average number of stack operations of 4 per character, a scannerless parser
would do 4 8 = 32 times more work per token than a parser that reads a pre-
tokenized string. Furthermore, a scannerless parser has to consider all whitespace
and comment tokens. An average program consists of more than 50% whites-
pace which again multiplies the work by two, raising the difference between the
two methods to a factor of 64. Moreover, scannerless grammars are more non-
deterministic than their tokenized counterparts, increasing the burden on the
parsing algorithm even more.
Fortunately, it has been shown [5] that scannerless parsers are fast enough to
be applied to real programming languages. In this paper we investigate the im-
plementation of the Scannerless GLR (SGLR) parser provided with SDF [25,5].
It makes scannerless parsing feasible by rigorously limiting the non-determinism
that is introduced by scannerless parsing using disambiguation filters. It is and
has been used to parse many different kinds of legacy programming languages
and their dialects, experimental domain specific languages and all kinds of em-
beddings of languages into other languages. The parse trees that SGLR pro-
duces are used by a variety of tools including compilers, static checkers, architec-
ture reconstruction tools, source-to-source transformers, refactoring, and editors
in IDEs.
As SDF is applied to more and more diverse languages, such as scripting and
embedded web scripting languages, and in an increasing number of contexts such
as in plugins for the Eclipse IDE, the cost of scannerless parsing has become more
of a burden. That is our motivation to investigate algorithmic changes to SGLR
that would improve its efficiency. Note that the efficiency of SGLR is defined by
the efficiency of the intertwined parsing and filtering algorithms.
We have succeeded in replacing the embedded parsing algorithm in SGLR—
based on Farshi’s version of GLR [14]—with the faster Right-Nulled GLR al-
gorithm [18,9]. RNGLR is a recent derivative of Tomita’s GLR algorithm that,
intuitively, limits the cost of non-determinism in GLR parsers. We therefore in-
vestigated how much the RNGLR algorithm would mitigate the cost of scanner-
less parsing, which introduces more non-determinism. The previously published
results on RNGLR can not be extrapolated directly to SGLR because of (A) the
Faster Scannerless GLR Parsing 129
missing scanner, which may change trade-offs between stack traversal and stack
construction and (B) the fact that SGLR is not a parsing algorithm per s e, but
rather a parsing and filtering algorithm.The benefit of RNGLR may easily be
insignificant compared to the overhead of scannerless parsing and the additional
costs of filtering.
In this paper we show that a Scannerless Right-Nulled GLR parser and filter
is actually significantly faster on real applications than traditional SGLR. The
amalgamated algorithm, called SRNGLR, requires adaptations in parse table
generation, parsing and filtering, and post-parse filtering stages of SGLR. In
Section 2 we analyze and compare the run-time efficiency of SGLR and the new
SRNGLR algorithm. In Sections 3 and 4 we explain what the differences between
SGLR and SRNGLR are. We conclude the paper with a discussion in Section 6.
In Sections 3 and 4 we will delve into the technical details of our parsing al-
gorithms. Before doing so, we first present our experimental results. We have
compared the SGLR and SRNGLR algorithms using grammars for an extended
version of ANSI-C—dubbed C’—, C++, Java, Python, SASL and Γ1—a small
grammar that triggers interesting behaviour in both algorithms. Table 1 de-
scribes the grammars and input strings used. Table 2 provides statistics on the
sizes of the grammars. We conducted the experiments on a 2.13GHz Intel Dual
Core with 2GB of memory, running Linux 2.6.20.
SGLR and SRNGLR are comprised of three different stages: parse table gen-
eration, parsing and post-parse filtering. We focus on the efficiency of the latter
two, since parse table generation is a one-time cost. We are not interested in
the runtime of recognition without tree construction. Note that between the
two algorithms the parsing as well as the filtering changes and that these influ-
ence each other. Filters may prevent the need to parse more and changes in the
parsing algorithm may change the order and shape of the (intermediate) parse
forests that are filtered. Efficiency measurements are also heavily influenced by
the shapes of the grammars used as we will see later on.
The SRNGLR version of the parser was tested first to output the same parse
forests that SGLR does, modulo order of trees in ambiguity clusters.
Table 3 and Figure 1 show the arithmetic mean time of five runs and Table 4
provides statistics on the amount of work that is done. GLR parsers use a Graph
Structured Stack (GSS). The edges of this graph are visited to find reductions
and new nodes and edges are created when parts of the graph can be reduced
or the next input character can be shifted. Each reduction also leads to the
construction of a new parse tree node and sometimes a new ambiguity cluster. An
ambiguity cluster encapsulates different ambiguous trees for the same substring.
For both algorithms we count the number of GSS edge visits, GSS node creations,
edge and node visits for garbage collection, and parse tree node and ambiguity
cluster visits for post-parse filtering. Note that garbage collection of the GSS is
an important factor in the memory and run-time efficiency of GLR.
130 G. Economopoulos, P. Klint, and J. Vinju
Tabl e 1 . Grammars and input strings used
Name Grammar description Input size
Input description
C’ ANSI-C plus ambiguous excep-
tion handling extension
32M/1M Code for an embedded sys-
C++ Approaches ISO standard, with
GNU extensions
2.6M/111K Small class that includes
much of the STL
Java Grammar from [6] that imple-
ments Java 5.0
0.5M/18k Implementation of The
Meta-Environment [3]
Python Derived from the reference man-
ual [24], ambiguous due to miss-
ing off-side rule2implementation
7k/201 from Python dis-
SASL Taken from [22], ambiguous due
to missing off-side rule implemen-
2.5k+/114+ Standard prelude, concate-
nated to increasing sizes
Γ1S::= SSS |SS |a; triggers
worst-case behaviour [9]
1–50/1 Strings of a’s of increasing
Tabl e 2 . Grammar statistics showing nullable non-terminals (NNT), nullable produc-
tions (NP), right-nullable productions (RNP), SLR(1) states, shifts and gotos, reduc-
tions and reductions with dynamic lookahead restriction (LA Reductions)
NNT NP RNP States Shifts+Gotos Reductions LA Reductions
C’ 71 93 94 182k 37k 18k 23k 5.9k 6.3k
C++ 90 112 102 112k 18k 19k 19k 1.5k 1.5k
Java 81 112 116 67k 9.7k 5.9k 6.0k 1.0k 1.1k
Python 56 74 85 22k 3.4k 1.7k 1.9k 0 0
SASL 16 21 22 4.5k 0.9k 0.5k 0.6k 0 0
Γ100013 30 13 15 0 0
For this benchmark, SRNGLR is on average 33% faster than SGLR with a
smallest speedup of 9.8% for C and a largest speedup of 95% for SASL. Appar-
ently the speedup is highly dependent on the specific grammar. If we disregard
SASL the improvement is still 20% on average and if we also disregard Γ50
the average drops to a still respectable 16% improvement for the mainstream
languages C, C++, Java and Python. The results show that SRNGLR parsing
speed is higher (up to 95%) for grammars that are highly ambiguous such as
SASL. SRNGLR also performs better on less ambiguous grammars such as Java
(14% faster). The parsing time is always faster, and in most cases the filtering
time is also slightly faster for SRNGLR but not significantly so.
The edge visit statistics (Table 4 and Figure 3) explain the cause of the
improved parsing time. Especially for ambiguous grammars the SGLR algorithm
2The off-side rule was not implemented because SDF does not have a declarative
disambiguation construct that can expresses its semantics. It can be implemented
in ASF as a post-parse traversal, but has no effect on the timings described here.
Faster Scannerless GLR Parsing 131
Tabl e 3 . Speed (characters/second), Parse time (seconds) , Filter time (seconds), Total
time (seconds) and Speedup (%) of SGLR (S) and SRNGLR (SRN). k=10
C’ C++ Java Python SASL80 Γ1
Speed (chars/sec.) 385k 443k 121k 175k 404k 467k 178 904 78 1k 4.7 24
Parse time (sec.) 84.2 73.2 21.5 14.9 2.1 1.8 39.2 7.7 4.8k 202.2 10.8 2.1
Filter time (sec.) 102.9 95.5 5.7 5.6 0.8 0.7 327.3 298.8 1.6 1.6 7.7 9.5
Tot a l tim e ( sec. ) 187.2 168.8 27.3 20.6 2.9 2.5 366.5 306.5 4.8k 203.9 18.5 11.6
Speedup (%) 9.8 24.5 13.8 16.4 95 37.6
Tabl e 4 . Workload data. Edges traversed searching reductions (ET), edges traversed
searching existing edge (ES), GSS nodes created (NC), GSS edges created (EC), edges
traversed for garbage collection (GC), ambiguity nodes created while filtering (FAC),
and parse tree nodes created while filtering (FNC). k=10
6,B =10
C’ C++ Java Python SASL80 Γ50
ET 149M 44M 26M 6.6M 3.2M 0.9M 90M 3.4M 71B 165M 48M 0.7M
ES 81M 18M 145M 27M 5.0M 0.9M 1.8B 234M 16B 14B 28M 14M
NC 141M 143M 19M 20M 3.0M 3.0M 157k 157k 2.4M 2.4M 252 252
EC 154M 157M 30M 31M 3.5M 3.4M 962k 962k 44M 44M 3.9k 3.9k
GC 13M 13M 6.2M 6.8M 0.7M 0.6M 2.0M 2.0M 88M 88B 14k 14k
FAC 30k 30k 5.6k 5.6k 0083k 83k 48k 48k 1.2k 2.1k
FNC 241M 241M 13M 13M 1.6M 1.6M 707M 707M 3.1M 3.1M 1.1M 1.3M
traverses many more GSS edges. According to the time measurements this is
significant for real world applications of scannerless parsing.
Filtering time is improved in all but the Γ1case, although the improvement
is not greater than 10%. The workload statistics show that about the same
number of nodes are created during filtering. The differences are lost in the
rounding of the numbers, except for the Γ1case which shows significantly more
node creation at filtering time. This difference is caused by different amounts of
sharing of ambiguity clusters between the two versions. The amount of sharing
in ambiguity clusters during parsing, for both versions, depends on the arbitrary
ordering of reduction steps. I.e. it is not relevant for our analysis.
Notice that the parse time versus filtering time ratio can be quite different
between languages. This highly depends on the shape of the grammar. LR fac-
tored grammars have higher filtering times due to the many additional parse
tree nodes for chain rules. The Python grammar is an example of such a gram-
mar, while SASL was not factored and has a minimum number of non-terminals
for its expression sub-language. Shorter grammars with less non-terminals have
better filtering speed. We expect that by “unfactoring” the Python grammar a
lot of speed may be gained.
Figure 2 depicts how SRNGLR improves parsing speed as the input length
grows. For Γ1it is obvious that the gain is higher when the input gets larger.
132 G. Economopoulos, P. Klint, and J. Vinju
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Runtime (secs)
Fig. 1. Runtime comparison between SGLR (first col.) and SRNGLR (second col.).
The other bar accounts for the time taken to read and process the input string and
parse table.
Input string length (number of characters)
Parsing time (seconds)
Fig. 2. Comparison of SGLR and
SRNGLR parsing time for Γ1
Python Γ1
Reduction of edge traversals by RNGLR
Fig. 3. Correlation between saving of
edge traversals and parsing speedup
Note that although Γ1does not have any right-nullable productions (see Table
2) there is still a significant gain. The reason for this is that SRNGLR prevents
work from being done for all grammars (see Section 3).
From these results we may conclude that SRNGLR clearly introduces a struc-
tural improvement that increases the applicability of scannerless GLR parsing to
large programs written in highly ambiguous scripting languages such as Python
and SASL. Also, we may conclude that it introduces a significant improvement
for less ambiguous or non-ambiguous languages and that the shape of a grammar
highly influences the filtering speed.
Faster Scannerless GLR Parsing 133
In this section we outline the RNGLR and SGLR algorithms and highlight the
main differences between them. There are four main differences between the
SGLR and RNGLR algorithms:
Different parse tables formats are used; SLR(1) [25] versus RN [9].
SGLR does more traversals of the GSS during parsing than RNGLR.
Different parse forest representations are used; maximally shared trees [23]
versus SPPF’s [15].
SGLR implements disambiguation filters [5] whereas RNGLR does not.
The RNGLR algorithm combines adaptations in the parse table generation al-
gorithm with simplifications in the parser run-time algorithm. It is based on
Tomita’s algorithm, called Generalized LR (GLR) [20]. GLR extends the LR
parsing algorithm to work on all context-free grammars by replacing the stack of
the LR parsing algorithm with a Graph Structured Stack (GSS). Using the GSS
to explore different derivations in parallel, GLR can parse sentences for gram-
mars with parse tables that contain LR conflicts rather efficiently. However, the
GLR algorithm fails to terminate on certain grammars. Farshi’s algorithm fixes
the issue in a non-efficient manner, by introducing extra searching of the GSS
[14]. This algorithm is the basis for SGLR. The RNGLR algorithm fixes the same
issue in a more efficient manner.
RNGLR introduces a modified LR parse table: an RN table. RN tables are
constructed in a similar way to canonical LR tables, but in addition to the
standard reductions, reductions on right nullable rules are also included. A right
nullable rule is a production rule of the form A::= αβ where β
reducing the left part of the right nullable rule (α) early, the RNGLR algorithm
avoids the problem that Tomita’s algorithms suffered from and hence does not
require Farshi’s expensive correction. However, since the right nullable symbols
of the rule (β) have not been reduced yet it is necessary to pre-construct the
parsetreesofthosesymbols.Thesenullabletreesarecalledε-trees and since
they are constant for a given grammar, they can be constructed at parse table
generation time and included in the RN parse table. The early RN reduction
will construct a full derivation simply by including the pre-constructed trees.
It is well known that the number of parses of a sentence with an ambiguous
grammar may grow exponentially with the size of the sentence [7]. To avoid
exponential complexity, GLR-style algorithms build an efficient representation
of all possible parse trees, using subtree sharing and local ambiguity packing.
However, the SGLR and RNGLR algorithms construct parse trees in different
ways and use slightly different representations. RNGLR essentially follows the
approach described by Rekers – the creation and sharing of trees is handled
directly by the parsing algorithm – but does not construct the most compact
representation possible. The SGLR algorithm uses the ATerm library [23] to
3α, β are possibly empty lists of terminals and non-terminals, is the empty string
represents a derivation in zero or more steps
134 G. Economopoulos, P. Klint, and J. Vinju
construct parse trees thereby taking advantage of the maximal sharing it imple-
ments. This approach has several consequences. The parsing algorithm can be
simplified significantly by replacing all parse tree creation and manipulation code
with calls to the ATerm library. Although the library takes care of all sharing,
the creation of ambiguities and cycles requires extra work (see Section 4.1).
As previously mentioned, in addition to the different construction approaches,
a slightly different representation of parse forests is used. RNGLR labels interior
nodes using non-terminal symbols and uses packing nodes to represent ambigui-
ties [18]. SGLR labels interior nodes with productions and represents ambiguous
trees using ambiguity clusters labeled by non-terminal symbols. The reason that
production rules are used to label the interior nodes of the forest is to implement
some of the disambiguation filters that are discussed later in this section.
The SGLR algorithm. Is different from RNGLR mainly due to the filters that
are targeted at solving lexical ambiguity. Its filters for priority and preference
will be discussed as well. SGLR introduces the following four types of filters:
follow restrictions, rejects, preferences and priorities. Each filter type targets a
particular kind of ambiguity. Each filter is derived from a corresponding declar-
ative disambiguation construct in the SDF grammar formalism [5]. Formally,
each filter is a function that removes certain derivations from parse forests (sets
of derivations). Practically, filters are implemented as early in the parsing ar-
chitecture as possible, i.e. removing reductions from parse tables or terminating
parallel stacks in the GSS.
Fou r fi lte r t ypes. We now briefly define the semantics of the four filter types
for later reference. A follow restriction is intended to implement longest match
and first match behaviour of lexical syntax. In the following example, the -/-
operator defines a restriction on the non-terminal I. Its parse trees may not be
followed immediately by any character in the class [A-Za-z0-9 ], which effectively
results in longest match behaviour for I:
I::= [A-Za-z][A-Za-z0-9 ]I-/- [A-Za-z0-9 ] (3.1)
In general, given a follow restriction A-/- αwhere Ais a non-terminal and α
is a character class, any parse tree whose root is A::= γwill be filtered if its
yield in the input string is immediately followed by any character in α. Multiple
character follow restrictions, as in A-/- α12...α
n, generalize the concept. If
each of the ncharacters beyond the yield of A, fit in their corresponding class
αithe tree with root Ais filtered. Note that the follow restriction incorporates
information from beyond the hierarchical context of the derivation for A, i.e. it
is not context-free.
The reject filter is intended to implement reservation, i.e. keyword reservation.
In the following example, the {reject}attribute defines that the keyword public
is to be reserved from I:
I::= [A-Za-z][A-Za-z0-9 ]I::= “public{reject}(3.2)
Faster Scannerless GLR Parsing 135
In general, given a production A::= γand a reject production A::= δ{reject},
all trees whose roots are labeled A::= δ{reject}are filtered and any tree whose
root is labeled A::= γis filtered if its yield is in the language generated by δ.
Reject filters give SGLR the ability to parse non-context-free languages such as
The preference filter is intended to select one derivation from several al-
ternative overlapping (ambiguous) derivations. The following example uses the
{prefer}attribute to define that in case of ambiguity the preferred tree should
be the only one that is not filtered. The dual of {prefer}is {avoid}.
I::= [A-Za-z][A-Za-z0-9 ]I::= “public” {prefer}(3.3)
In general, given nproductions A::= γ1to A::= γnand a preferred production
A::= δ{prefer}, any tree whose root is labeled by any of A::= γ1to A::= γn
will be filtered if its yield is in the language generated by δ. All trees whose roots
are A::= δ{prefer}remain. Dually, given an avoided production A::= κ{avoid}
any tree whose root is A::= κ{avoid}isfilteredwhenitsyieldisinoneofthe
languages generated by γ1to γn. In this case, all trees with roots A::= γ1to
A::= γnremain. Consequently, the preference filter can not be used to recognize
non-context-free languages.
The priority filter solves operator precedence and associativity. The following
example uses priority and associativity:
E::= EE{right}>E::= E“or” E{left}(3.4)
The >defines that no tree with the production at its root will have a child
tree with the “or at its root. This effectively gives the production higher
precedence. The {right}attribute defines that no tree with the ” production
at its root may have a first child with the same production at its root. In general,
we index the >operator to identify for which argument a priority holds and map
all priority and associativity declarations to sets of indexed priorities. Given an
indexed priority declaration A::= αBiβ>
iBi::= δ,whereBiis the ith symbol
in αBiβ,thenanytreewhoserootisA::= αBiβwith a subtree that has Bi::= δ
asitsrootatindexi, is filtered. The priority filter is not known to extend the
power of SGLR beyond recognizing context-free languages.
We now discuss the amalgamated SRNGLR algorithm that combines the scan-
nerless behaviour of SGLR with the faster parsing behaviour of RNGLR.
Although the main difference between SRNGLR and SGLR is in the imple-
mentation of the filters at parse table generation time — all of SGLR’s filters
need to be applied to the static construction of SRNGLR’s -trees — there are
also some small changes in the parser run-time and post-parse filtering.
136 G. Economopoulos, P. Klint, and J. Vinju
4.1 Construction of -Trees
The basic strategy is to first construct the complete -trees for each RN reduction
in a straightforward way, and then apply filters to them. We collect all the
productions for nullable non-terminals from the input grammar, and then for
each non-terminal we produce all of its derivations, for the empty string, in a
top-down recursive fashion. If there are alternative derivations, they are collected
under an ambiguity node.
We use maximally shared ATerms [4] to represent parse trees. ATerms are
directed acyclic graphs, which prohibits by definition the construction of cycles.
However, since parse trees are not general graphs we may use the following trick.
The second time a production is used while generating a nullable tree, a cycle
is detected and, instead of looping, we create a cycle node. This special node
stores the length of the cycle. From this representation a (visual) graph can be
trivially reconstructed.
Note that this representation of cycles need not be minimal, since a part of the
actual cycle may be unrolled and we detect cycles on twice visited productions,
not non-terminals. The reason for checking on productions is that the priority
filter is specific for productions, such that after filtering, cycles may still exist,
but only through the use of specific productions.
4.2 Restrictions
We distinguish single character follow restrictions from multiple lookahead re-
strictions. The first are implemented completely statically, while the latter have
a partial implementation at parse table generation time and a partial implemen-
tation during parsing.
Parse table generation. An RN reduction A::= α·βwith nullable tree Tβin
the parse table can be removed or limited to certain characters on the lookahead.
When one of the non-terminals Bin Tβhas a follow restriction B-/- γ,Tβmay
have less ambiguity or be filtered completely when a character from γis on the
lookahead for reducing A::= α·β. Since there may be multiple non-terminals
in Tβ, there may be multiple follow restrictions to be considered.
The implementation of follow restrictions starts when adding the RN reduc-
tion to the SLR(1) table. For each different kind of lookahead character (token),
thenullabletreeforTβis filtered, yielding different instances of Tβfor different
lookaheads. While filtering we visit the nodes of Tβin a bottom-up fashion. At
each node in the tree the given lookahead character is compared to the applicable
follow restrictions. These are computed by aggregation. When visiting a node la-
belled C::= DE, the restriction class for Cis the union of the restriction classes
of Dand E. This means that Cis only acceptable when both follow restrictions
are satisfied. When visiting an ambiguity node with two children labeled Fand
G, the follow restrictions for this node are the intersections of the restrictions of
Fand G. This means that the ambiguity node is acceptable when either one of
the follow restrictions is satisfied.
Faster Scannerless GLR Parsing 137
If the lookahead character is in the restricted set, the current node is filtered,
if not the current node remains. The computed follow restrictions for the current
node are then propagated up the tree. Note that this algorithm may lead to the
complete removal of Tβ, and the RN reduction for this lookahead will not be
added. If Tβis only partially filtered, and no follow restriction applies for the
non-terminal Aof the RN reduction, the RN reduction is added to the table,
including the filtered -tree.
Parser run-time. Multiple character follow restrictions cannot be filtered stat-
ically. They are collected and the RN-reductions are added and marked to be
conditional as lookahead reductions in the parsetable. Both the testing of the
follow restriction as well as the filtering of the -treemustbedoneatparse-time.
Before any lookahead RN-reduction is applied by the parsing algorithm, the
-tree is filtered using the follow restrictions and the lookahead information from
the input string. If the filtering removes the tree completely, the reduction is not
performed. If it is not removed completely, the RN reduction is applied and a
tree node is constructed with a partially filtered -tree.
4.3 Priorities
Parse table generation. The priority filters only require changes to be made
to the parse table generation phase; the parser runtime and post parse filtering
phases remain the same as SGLR. The priority filtering depends on the chosen
representation of the -trees (see also Section 3); each node holds a production
rule and cycles are unfolded once. Take for example S::= SS{left}|. The filtered
-tree for this grammar should represent derivations where S::= SS can be
nested on the left, but not on the right. The cyclic tree for Smust be unfolded
once to make one level of nesting explicit. Then the right-most derivations can
be filtered. Such representation allows a straightforward filtering of all trees
that violate priority constraints. Note that priorities may filter all of the -tree,
resulting in the removal of the corresponding RN reduction.
4.4 Preferences
Parse table generation. The preference filter strongly resembles the priority
filter. Preferences are simply applied to the -trees, resulting in smaller -trees.
However, preferences can never lead to the complete removal of an -tree.
Post-parse filter. RN reductions labeled with {prefer}or {avoid}are processed
in a post-parse filter in the same way as normal reductions were processed in
4.5 Rejects
The implementation of the reject filter was changed in both SGLR and SRNGLR
to improve the predictability and of its behaviour.
138 G. Economopoulos, P. Klint, and J. Vinju
Parse table generation. If any nullable production is labeled with {reject},
then the empty language is not acceptable by that production’s non-terminal.
If such a production occurs in an -tree, we can statically filter according to the
definition of rejects in Section 3. If no nullable derivation is left after filtering,
we can also remove the entire RN reduction.
Parser run-time. Note that we have changed the original algorithm [25] for re-
ject filtering at parser run-time for both SGLR and SRNGLR. The completeness
and predictability of the filter have been improved. The simplest implementation
of reject is to filter redundant trees in a post-parse filter, directly following the
definition of its semantics given in Section 3. However, the goal of the imple-
mentation is to prohibit further processing on GSS stacks that can be rejected
as early as possible. This can result in a large gain in efficiency, since it makes
the parsing process more deterministic, i.e. there exist on average less parallel
branches of the GSS during parsing.
The semantics of the reject filter is based on syntactic overlap, i.e. ambiguity
(Section 3). So, the filter needs to detect ambiguity between a rejected production
A::= γ{reject}and a normal production for A::= δ. The goal is to stop further
processing reductions of A. For this to work, the ambiguity must be detected
before further reductions on Aare done. Such ordering of the scheduling of
reductions was proposed by Visser [25]. However, there are certain grammars
(especially those with nested, or nullable rejects) for which the ordering does not
work and rejected trees do not get filtered correctly. Alternative implementations
of Visser’s algorithm have worked around these issues at the cost of filtering too
many derivations.
We have implemented an efficient method that does not rely on the order that
reductions get performed. The details of this reject implementation are:
Edges created by a reduction of a rejected production are stored separately
in GSS nodes. We prevent other reductions traversing the rejected edges,
thereby preventing possible further reductions on many stacks.
In GLR, edges collect ambiguous derivations, and if an edge becomes rejected
because one of the alternatives is rejected, it stays rejected.
Rejected derivations that escape are filteredinapost-parsetreewalker.They
mayescapewhenanalternative,non-rejected, reduction creates an edge and
this edge is traversed by a third reduction before the original edge becomes
rejected by a production marked with {reject}.
Like the original, this algorithm filters many parallel stacks at run-time with
the added benefit that it is more clearly correct. We argue that: (A) we do not
filter trees that should not be filtered, (B) we do not depend on the completeness
of the filtering during parse time, and (C) we do not try to order scheduling of
reduce actions, which simplifies the implementation of SRNGLR significantly.
ThePost-parsefilter.This follows the definition of the semantics described
in Section 3. To handle nested rejects correctly, the filter must be applied in a
bottom-up fashion.
Faster Scannerless GLR Parsing 139
5 Related Work
The cost of general parsing as opposed to deterministic parsing or parsing with
extended lookahead has been studied in many different ways. Our contribution
is a continuation of the RNGLR algorithm applied in a different context.
Despite the fact that general context-free parsing is a mature field in Com-
puter Science, its worst case complexity is still unknown. The algorithm with
the best asymptotic time complexity to date is presented by Valiant [21]. How-
ever, because of the high constant overheads this approach is unlikely to be used
in practice. There have been several attempts at speeding the run time of LR
parsers that have focused on achieving speed ups by implementing the handle
finding automaton (DFA) in low-level code (see [12]). A different approach to
improving efficiency is presented in [2], the basic ethos of which is to reduce the
reliance on the stack. Although this algorithm fails to terminate in certain cases,
the RIGLR algorithm presented in [13] has been proven correct for all CFGs.
Two other general parsing algorithms that have been used in practice are the
CYK [27] and Earley [8] algorithms. Both display cubic worst case complexity,
although the CYK algorithm requires grammars to be transformed to Chomsky
Normal Form before parsing. The BRNGLR [19] algorithm achieves cubic worst
case complexity without needing to transform the grammar.
Saloman and Cormack [16] first used scannerless parsing to describe deter-
minstic parsers of complete character level grammars. Another deterministic
scannerless parsing technique that uses Parsing Expression Grammars instead
of CFGs, is the Packrat [10] algorithm and its implementations [11]. It has been
shown to be useful in parsing extensible languages. Another tool that has been
used to generate scanners and parsers for extensible languages with embedded
DSLs is Copper [26]. It uses an approach called context-aware scanning where
the scanner uses contextual information from the parser to be more discriminat-
ing with the tokens it returns. This allows the parser to parse a larger class of
languages than traditional LR parsers that use separate scanners.
6 Conclusions
We improved the speed of parsing and filtering for scannerless grammars signif-
icantly by applying the ideas of RNGLR to SGLR. The disambiguation filters
that complement the parsing algorithm at all levels needed to be adapted and
extended. Together the implementation of the filters and the RN tables make
scannerless GLR parsing quite a bit faster. The application areas in software
renovation and embedded language design are directly serviced by this. It allows
experimentation with more ambiguous grammars, e.g. interesting embeddings of
scripting languages, domain specific languages and legacy languages.
Acknowledgements. We are grateful to Arnold Lankamp for helping to im-
plement the GSS garbage collection scheme for SRNGLR. The first author was
partially supported by EPSRC grant EP/F052669/1.
140 G. Economopoulos, P. Klint, and J. Vinju
1. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques and Tools.
Addison-Wesley, Reading (1986)
2. Aycock, J., Horspool, R.N., Janousek, J., Melichar, B.: Even faster generalised LR
parsing. Acta Inform. 37(9), 633–651 (2001)
3. van den Brand, M.G.J., van Deursen, A., Heering, J., de Jong, H.A., de Jonge, M.,
Kuipers, T., Klint, P., Moonen, L., Olivier, P.A., Scheerder, J., Vinju, J.J., Visser,
E., Visser, J.: The ASF+SDF meta-environment: A component-based language
development environment. In: Wilhelm, R. (ed.) CC 2001. LNCS, vol. 2027, pp.
365–370. Springer, Heidelberg (2001)
4. van den Brand, M.G.J., de Jong, H.A., Klint, P., Olivier, P.A.: Efficient Annotated
Terms. Softw., Pract. Exper. 30(3), 259–291 (2000)
5. van den Brand, M.G.J., Scheerder, J., Vinju, J.J., Visser, E.: Disambiguation Fil-
ters for Scannerless Generalized LR Parsers. In: Horspool, R.N. (ed.) CC 2002.
LNCS, vol. 2304, pp. 143–158. Springer, Heidelberg (2002)
6. Bravenboer, M., Tanter, ´
E., Visser, E.: Declarative, formal, and extensible syntax
definition for AspectJ. SIGPLAN Not. 41(10), 209–228 (2006)
7. Church, K., Patil, R.: Coping with syntactic ambiguity or how to put the block
in the box on the table. American Journal of Computational Linguistics 8(3–4),
139–149 (1982)
8. Earley, J.: An efficient context-free algorithm. Comm. ACM 13(2), 94–102 (1970)
9. Economopoulos, G.R.: Generalised LR parsing algorithms. PhD thesis, Royal Hol-
loway, University of London (August 2006)
10. Ford, B.: Parsing expression grammars: a recognition-based syntactic foundation.
In: POPL 2004, pp. 111–122. ACM, New York (2004)
11. Grimm, R.: Better extensibility through modular syntax. In: PLDI 2006, pp. 38–51.
ACM, New York (2006)
12. Horspool, R.N., Whitney, M.: Even faster LR parsing. Softw., Pract. Exper. 20(6),
515–535 (1990)
13. Johnstone, A., Scott, E.: Automatic recursion engineering of reduction incorpo-
rated parsers. Sci. Comp. Programming 68(2), 95–110 (2007)
14. Nozohoor-Farshi, R.: GLR parsing for -grammars.In:Tomita,M.(ed.)Gener-
alized LR Parsing, ch. 5, pp. 61–75. Kluwer Academic Publishers, Netherlands
15. Rekers, J.: Parser Generation for Interactive Environments. PhD thesis, University
of Amsterdam (1992)
16. Salomon, D.J., Cormack, G.V.: Scannerless NSLR(1) parsing of programming lan-
guages. SIGPLAN Not. 24(7), 170–178 (1989)
17. Salomon, D.J., Cormack, G.V.: The disambiguation and scannerless parsing of
complete character-level grammars for programming languages. Technical Report
95/06, Dept. of Computer Science, University of Manitoba (1995)
18. Scott, E., Johnstone, A.: Right nulled GLR parsers. ACM Trans. Program. Lang.
Syst. 28(4), 577–618 (2006)
19. Scott, E., Johnstone, A., Economopoulos, R.: BRNGLR: a cubic Tomita-style GLR
parsing algorithm. Acta Inform. 44(6), 427–461 (2007)
20. Tomita, M.: Efficient Parsing for Natural Languages. A Fast Algorithm for Prac-
tical Systems. Kluwer Academic Publishers, Dordrecht (1985)
21. Valiant, L.G.: General context-free recognition in less than cubic time. J. Comput.
System Sci. 10, 308–315 (1975)
Faster Scannerless GLR Parsing 141
22. van den Brand, M.G.J.: Pregmatic, a generator for incremental programming en-
vironments. PhD thesis, Katholieke Universiteit Nijmegen (1992)
23. van den Brand, M.G.J., de Jong, H.A., Klint, P., Olivier, P.A.: Efficient annotated
terms. Softw., Pract. Exper. 30(3), 259–291 (2000)
24. van Rossum, G.: Python reference manual,
25. Visser, E.: Syntax Definition for Language Prototyping. PhD thesis, University of
Amsterdam (1997)
26. Eric, R., Van Wyk, E.R., Schwerdfeger, A.C.: Context-aware scanning for parsing
extensible languages. In: GPCE 2007, pp. 63–72. ACM Press, New York (2007)
27. Younger, D.H.: Recognition and parsing of context-free languages in time n3.In-
form. and control 10(2), 189–208 (1967)
... (1) Scannerless architectures, which solve the problem by abandoning the use of the scanner [4,5,11]. (2) Context-dependent scanner-based architectures, where the scanner receives the contextual information from the parser [6][7][8][9]. Most are deterministic, which means they are still dependent on the ad hoc disambiguation rules. ...
... Therefore, there was no inclination for the language to be parsed using deterministic methods, such as LR. However, even modern programming languages most commonly cannot be parsed using deterministic methods without ad hoc disambiguation rules and/or modifications to the grammar [11,14]. ...
... The original description of the SGLR parser uses the solution by Farshi [16,39] to parse ε-grammars. Since this solution is less efficient, the SRNGLR was developed by Economopoulos [11], which is instead based on RNGLR. ...
Full-text available
The limitations of traditional parsing architecture are well known. Even when paired with parsing methods that accept all context-free grammars (CFGs), the resulting combination for any given CFG accepts only a limited subset of corresponding character-level context-free languages (CFL). We present a novel scanner-based architecture that for any given CFG accepts all corresponding character-level CFLs. It can directly parse all possible specifications consisting of a grammar and regular definitions. The architecture is based on right-nulled generalized LR (RNGLR) parsing and is a generalization of the context-aware scanning architecture. Our architecture does not require any disambiguation rules to resolve lexical conflicts, it conceptually has an unbounded parser and scanner lookahead and it is streaming. The added robustness and flexibility allow for easier grammar development and modification.
... Working as the token level is far easier for the parser, since working directly with single characters as tokens tend to explode the number of rules in the grammar and add a lot of ambiguities. However, there are scannerless approaches to parsing [EKV09], where the base element of rules are characters and not tokens. ...
... SGLR. Scannerless GLR [EKV09] is a version of the classic GSS-based GLR parsing algorithm adapted to remove the need for a separate scanning phase to create tokens. SGLR defines disambiguation filters to improve performance by removing the additional ambiguities introduced by the absence of a scanner. ...
... Even if GLR is powerful enough to take care of conflicts, it is not sufficient to completely disambiguate a grammar. Disambiguation is usually present in various attire, ranging from precedence rules, priority of alternatives in grammar rules (PEG parsers [For04]), filters [EKV09] or custom side-effect-based disambiguation in semantic actions. Support for side effects in semantic actions is crucial for disambiguation even in GLR. ...
Code transformations are needed in various cases: refactorings, migra- tions, code specialization, and so on. Code transformation engines work by finding a pattern in the source code and rewriting its occurrences according to the transformation. The transformation either rewrites the occurrences, elements of the intermediate representation (IR) of the language, into new elements or di- rectly rewrites the source code. In this work, we focused on source rewriting since it offers more flexibility through arbitrary transformations, especially for migrations and specializations.Matching patterns come in two different flavours, explicit and syntactic. The former requires the user to know the IR of the language, a heavy knowledge burden. The latter only relies on the syntax of the matched language and not its IR, but requires significantly more work to implement the language back-ends. Language experts tend to know the IR and the syntax of a language, while other users know only the syntax.We propose a pattern matching engine offering a hybrid pattern representation: both explicit and syntactic matching are available in the same pattern. The engine always defaults to syntactic as it is the lowest barrier to entry for patterns. To counterbalance the implementation cost of language back-ends for syntactic pattern matching, we take a generative approach. We combine the hybrid pattern matching engine with a parser generator. The parser generator generates generalized LR (GLR) parsers capable of not only parsing the source but also the hybrid pattern. The back-end implementer only needs to add one line to the grammar of the language to activate the pattern matching engine.This approach to pattern matching requires GLR parsers capable of forking and keeping track of each individual fork. These GLR implementations suffer the more forking is done to handle ambiguities and patterns require even more forking. To prevent an explosion, our Fibered-GLR parsers merge more often and allow for classic disambiguation during the parse through side-effects.
... Another line of work, on Scannerless GLR parsing [Economopoulos et al. 2009;van den Brand et al. 2002], also aims to eliminate the boundary between lexers and parsers, but in the interface (not just in the implementation, as in flap). The principal aim is a principled way to handle lexical ambiguity. ...
... The principal aim is a principled way to handle lexical ambiguity. Scannerless parsing carries considerable cost, often running orders of magnitude slower than flap according to the figures given by Economopoulos et al. [2009]. ...
Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.
... Two implementations of Boolean GLR are known [21,26]. In the literature, GLR parsers have sometimes been applied to analyzing programming languages symbol by symbol, without an intermediate layer of lexical analysis [8,10,17]: this is known as scannerless parsing [38]. The Boolean grammar for a programming language constructed in this paper follows the same principle, and a Boolean GLR parser for the new grammar is not much different from the GLR operating on an ordinary grammar. ...
Full-text available
A classical result by Floyd ("On the non-existence of a phrase structure grammar for ALGOL 60", 1962) states that the complete syntax of any sensible programming language cannot be described by the ordinary kind of formal grammars (Chomsky's ``context-free''). This paper uses grammars extended with conjunction and negation operators, known as conjunctive grammars and Boolean grammars, to describe the set of well-formed programs in a simple typeless procedural programming language. A complete Boolean grammar, which defines such concepts as declaration of variables and functions before their use, is constructed and explained. Using the Generalized LR parsing algorithm for Boolean grammars, a program can then be parsed in time $O(n^4)$ in its length, while another known algorithm allows subcubic-time parsing. Next, it is shown how to transform this grammar to an unambiguous conjunctive grammar, with square-time parsing. This becomes apparently the first specification of the syntax of a programming language entirely by a computationally feasible formal grammar.
... Despite that, there have been attempts to augment various existing parsing methods, including Earley and GLR, to enable parsing of context-dependant constraints in [8]. A variation of RNGLR parser suitable for scanerless parsing is described in [9]. ...
... While the requirement for the parser to be scannerless may be debatable, the primary reason why the scanners were used in the parsing process is performance: almost in all cases the scanners that are being used in various programming language compilers run in linear time. However with the advent of more powerful computers, newer parsing methods (such as scanner-less RNGLR (Economopoulos et al., 2009), Yakker (Jim et al., 2010) and Packrat (Ford, 2002)) are fast enough that the use of a dedicated scanner is unnecessary. ...
Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.
Full-text available
The Generalized LR parsing algorithm for context-free grammars is notable for having a decent worst-case running time (cubic in the length of the input string, if implemented efficiently), as well as much better performance on “good” grammars. This paper extends the Generalized LR algorithm to the case of “grammars with left contexts” (M. Barash, A. Okhotin, “An extension of context-free grammars with one-sided context specifications”, Inform. Comput., 2014), which augment the context-free grammars with special operators for referring to the left context of the current substring, along with a conjunction operator (as in conjunctive grammars) for combining syntactical conditions. All usual components of the LR algorithm, such as the parsing table, shift and reduce actions, etc., are extended to handle the context operators. The resulting algorithm is applicable to any grammar with left contexts and has the same worst-case cubic-time performance as in the case of context-free grammars.
This chapter develops a modified version of Tomita’s algorithm for parsing arbitrary context-free grammars with e-productions. It also describes the parsing of cyclic grammars within the frame of this algorithm.
Conventional LR parser generators create tables which are used to drive a standard parser procedure. Much faster parsers can be obtained by compiling the table entries into code that is directly executed. A possible drawback with a directly executable parser is its large size. In this paper, we introduce optimization techniques that increase the parsing speed even further while simultaneously reducing the size of the parser.
This thesis concerns the parsing of context-free grammars. A parser is a tool, de-fined for a specific grammar, that constructs a syntactic representation of an input string and determines if the string is grammatically correct or not. An algorithm that is capable of parsing any context-free grammar is called a generalised (context-free) parser. This thesis is devoted to the theoretical analysis of generalised parsing algorithms. We describe, analyse and compare several algorithms that are based on Knuth’s LR parser. This work underpins the design and implementation of the Parser Animation Tool (PAT). We use PAT to evaluate the asymptotic complexity of generalised parsing algorithms and to develop the Binary Right Nulled Generalised LR algorithm – a new cubic worst case parser. We also compare the Right Nullable Generalised LR, Reduction Incorporated Generalised LR, Farshi, Tomita and Ear-ley algorithms using the statistical data collected by PAT. Our study indicates that the overheads associated with some of the parsing algorithms may have significant consequences on their behaviour.
The disadvantages of traditional two-phase parsing (a scanner phase preprocessing input for a parser phase) are discussed. We present metalanguage enhancements for context-free grammars that allow the syntax of programming languages to be completely described in a single grammar. The enhancements consist of two new grammar rules, the exclusion rule, and the adjacency-restriction rule. We also present parser construction techniques for building parsers from these enhanced grammars, that eliminate the need for a scanner phase.
Tomita-style generalised LR (GLR) algorithms extend the standard LR algorithm to non-deterministic grammars by performing all possible choices of action. Cubic complexity is achieved if all rules are of length at most two. In this paper we shall show how to achieve cubic time bounds for all grammars by binarising the search performed whilst executing reduce actions in a GLR-style parser. We call the resulting algorithm Binary Right Nulled GLR (BRNGLR) parsing. The binarisation process generates run-time behaviour that is related to that shown by a parser which pre-processes its grammar or parse table into a binary form, but without the increase in table size and with a reduced run-time space overhead. BRNGLR parsers have worst-case cubic run time on all grammars, linear behaviour on LR(1) grammars and produce, in worst-case cubic time, a cubic size binary SPPF representation of all the derivations of a given sentence.
We prove a property of generalized LR (GLR) parsing – if the grammar is without right and hidden left recursions, then the number of consecutive reductions between the shifts of two adjacent symbols cannot be greater than a constant. Further, we show that this property can be used for constructing an optimized version of our GLR parser. Compared with a standard GLR parser, our optimized parser reads one symbol on every transition and performs significantly fewer stack operations. Our timings show that, especially for highly ambiguous grammars, our parser is significantly faster than a standard GLR parser.
An algorithm for general context-free recognition is given that requires less than n2 time asymptotically for input strings of length n.
A recognition algorithm is exhibited whereby an arbitrary string over a given vocabulary can be tested for containment in a given context-free language. A special merit of this algorithm is that it is completed in a number of steps proportional to the “cube” of the number of symbols in the tested string. As a byproduct of the grammatical analysis, required by the recognition algorithm, one can obtain, by some additional processing not exceeding the “cube” factor of computational complexity, a parsing matrix—a complete summary of the grammatical structure of the sentence. It is also shown how, by means of a minor modification of the recognition algorithm, one can obtain an integer representing the ambiguity of the sentence, i.e., the number of distinct ways in which that sentence can be generated by the grammar.The recognition algorithm is then simulated on a Turing Machine. It is shown that this simulation likewise requires a number of steps proportional to only the “cube” of the test string length.