Content uploaded by Terence Parr
Author content
All content in this area was uploaded by Terence Parr on Mar 18, 2014
Content may be subject to copyright.
Language Translation Using PCCTS and C++ Initial Release 1
Language Translation Using
PCCTS and C++
(A Reference Guide)
[Initial Release to Internet for Review and General Bashing]
Terence John Parr
Parr Research Corporation
June 26, 1995
2 Language Translation Using PCCTS and C++ Initial Release
Language Translation Using PCCTS and C++ Initial Release 3
Preface 7
CHAPTER 1 Introduction 11
1.1 ANTLR 12
1.2 SORCERER 13
1.2.1 Intermediate Representations and Translation 14
1.2.2 SORCERER Versus Hand-Coded Tree Walking 16
1.2.3 What Is SORCERER Good At and Bad At? 19
1.2.4 How Is SORCERER Different Than a Code-Generator Generator? 19
CHAPTER 2 ANTLR Reference 21
2.1 ANTLR Descriptions 21
2.1.1 Comments 23
2.1.2 #header Directive 23
2.1.3 #parser Directive 23
2.1.4 Parser Classes 24
2.1.5 Rules 25
2.1.6 Subrules (EBNF Descriptions) 26
2.1.7 Rule Elements 26
Actions 26
Semantic Predicate 26
Syntactic Predicate 27
Tokens, Token Classes and Token Operators 27
Rule References 27
Labels 27
AST Operators 28
Exception Operator 28
2.1.8 Multiple ANTLR Description Files 28
2.2 Lexical Directives 29
2.2.1 Token Definitions 29
2.2.2 Regular Expressions 31
2.2.3 Token Order and Lexical Ambiguities 31
2.2.4 Token Definition Files (#tokdefs) 32
2.2.5 Token Classes 33
2.2.6 Lexical Classes 35
Multiple grammars, multiple lexical analyzers 36
Single grammar, multiple lexical analyzers 36
2.2.7 Lexical Actions 36
2.2.8 Error Classes 37
2.3 Actions 37
2.3.1 Placement 37
2.3.2 Time of Execution 38
2.3.3 Interpretation of Action Text 38
2.3.4 Init-Actions 40
2.3.5 Fail Actions 40
2.3.6 Accessing Token Objects From Grammar Actions 41
2.4 C++ Interface 41
2.4.1 The Utility of C++ Classes in Parsing 42
4 Language Translation Using PCCTS and C++ Initial Release
2.4.2 Invoking ANTLR Parsers 43
2.4.3 ANTLR C++ Class Hierarchy 44
Token Classes 44
Scanners and Token Streams 46
Token Buffer 46
Parsers 47
AST Classes 48
2.5 Intermediate-Form Tree Construction 51
2.5.1 Required AST Definitions 51
2.5.2 AST Support Functions 52
2.5.3 Operators 53
2.5.4 Interpretation of C/C++ Actions Related to ASTs 55
2.6 Predicates 56
2.6.1 Semantic Predicates 56
Validating Semantic Predicates 57
Disambiguating Semantic Predicates 57
Semantic Predicates Effect Upon Syntactic Predicates 61
2.6.2 Syntactic Predicates 61
Syntactic Predicate Form and Meaning 62
Modified Parsing Strategy 63
Nested Syntactic Predicate Invocation 64
Efficiency 64
Syntactic Predicates Effect Upon Actions and Semantic Predicates. 64
Syntactic Predicates effect upon Grammar Analysis. 65
2.7 Parser Exception Handlers 65
2.7.1 Exception Handler Syntax 66
2.7.2 Exception Handler Order of Execution 67
2.7.3 Modifications to Code Generation 69
2.7.4 Semantic Predicates and NoSemViableAlt 69
2.7.5 Resynchronizing the Parser 70
2.7.6 The @ Operator 71
2.8 ANTLR Command Line Arguments 71
2.9 DLG Command Line Arguments 73
2.10 C Interface 74
2.10.1Invocation of C Interface Parsers 74
2.10.2Functions and Symbols in Lexical Actions 76
2.10.3Attributes Using the C Interface 77
Attribute Definition and Creation 77
Attribute References 78
Attribute destruction 79
Standard Attribute Definitions 80
2.10.4Interpretation of Symbols in C Actions 81
2.10.5AST Definitions 81
CHAPTER 3 SORCERER Reference 83
3.1 Introductory Examples 83
3.2 C++ Programming Interface 85
3.2.1 C++ Class Hierarchy 87
3.2.2 C++ Files 88
Language Translation Using PCCTS and C++ Initial Release 5
3.3 C Programming Interface 88
3.3.1 C Types 90
3.3.2 C Files 90
3.4 Token Type Definitions 91
3.5 Using ANTLR and SORCERER Together 91
3.5.1 Using the C++ Interface 91
3.5.2 Using the C Interface 92
3.6 SORCERER Grammar Syntax 93
3.6.1 Rule Definitions: Arguments and Return Values 94
3.6.2 Special Actions 95
3.6.3 Special Node References 95
3.6.4 Tree Patterns 96
3.6.5 EBNF Constructs in the Tree-Matching Environment 97
3.6.6 Element Labels 98
3.7 @-Variables 99
3.8 Embedding Actions For Translation 102
3.9 Embedding Actions For Tree Transformations 103
3.9.1 Deletion 104
3.9.2 Modification 104
3.9.3 Augmentation 105
3.10 C++ Support Classes and Functions 107
3.11 C Support Libraries 108
3.11.1Tree Library 108
3.11.2List Library 110
3.11.3Stack Library 111
3.11.4Integer Stack Library 111
3.12 Error Detection and Reporting 112
3.13 Command Line Arguments 112
APPENDIX Notes for New Users of PCCTS 115
A Completely Serious, No- Nonsense, Startlingly-Accurate
Autobiography 163
6 Language Translation Using PCCTS and C++ Initial Release
Language Translation Using PCCTS and C++ Initial Release 7
Preface
I like tools--always have. This is primarily because I’m fundamentally lazy and would much
rather work on something that made others productive rather than actually having to do any-
thing useful myself. For example, as a child, I was coerced into cutting lawns by my parents. I
spent hours trying to discover a mechanism by which the lawn mower could cut the lawn auto-
matically rather than simply taking the few minutes to fire up the mower and walk around the
lawn. This philosophy has followed me into adult life and eventually led to my guiding princi-
ple:
“Why program by hand in five days what you can spend five years of your life auto-
mating?”
This is pretty much what has happened to me with regards to language recognition and transla-
tion. At the end of my undergraduate studies at Purdue (roughly ‘86,’87), I was working for a
robotics company for whom I was developing an interpreter/compiler for a language called
KAREL. This was fun the first time (I inadvertently erased the whole thing). The second time,
I kept thinking “I don’t like or understand YACC. Isn’t there a way to automate what I’d build
by hand?” That thought kept rolling around in the back of my head even after I started EE
graduate school to pursue neural net research (my topic was going to be “Can I replace my lazy
brain with a neural net possessing the intelligence of a sea slug without anybody noticing”). As
an aside, I decided to take a course on language tool building taught by Hank Dietz.
The parser generator ANTLR eventually arose from the ashes of my course project with Hank
and I dropped neural nets in favor of ANTLR as my thesis project. This initial version of
ANTLR was pretty slick because it was a shorthand for what I’d build by hand, but at that time
ANTLR could only generate LL(1) parsers. Unfortunately, there were many such tools and
unless a tool was developed with parsing strength equal to or superior to YACC’s, nothing
would displace YACC as the de facto standard for parser generation; although, I was mainly
concerned with making myself more efficient at the time and had no World-domination aspira-
tions.
Preface
8 Language Translation Using PCCTS and C++ Initial Release
ANTLR currently has three things that make it generate strong parsers: (i) k>1 lookahead, (ii)
semantic predicates (the ability to have semantic information direct the parse), and (iii) syntac-
tic predicates (selective backtracking). Using more than a single symbol of lookahead has
always been desirous, but exponentially complex in time and space; therefore, I decided to
change the definition of k>1 lookahead and voila: LL(k) became possible (that’s how I escaped
Purdue with my Ph.D. before anyone got wise). Semantic and syntactic predicates are not my
ideas, but, together with Russell Quong at Purdue, we substantially augmented them to make
them truly useful. These capabilities can be shown in theory and practice to make ANTLR
parsers stronger than YACC’s pure LALR(1) parsers (our tricks could easily be added to an
LALR(1) parser generator, however). ANTLR also happens to be a flexible and easy-to-use
tool. As a result, ANTLR has become popular.
Near the end of my Ph.D., I started helping out some folks who wanted to build a FORTRAN
translator at the Army High Performance Computer Research center at the University of Min-
nesota. I used ANTLR to recognize their FORTRAN subset and built trees that I later traversed
with a number of (extremely similar) tree-walking routines. After building one too many of
these tree walkers, I thought “OK, I’m bored. Why can’t I build a tool that builds tree walk-
ers?” Such a tool would parse trees instead of text, but would be basically the same as ANTLR.
SORCERER was born. Building language translators using the ANTLR/SORCERER combi-
nation became much easier.
The one weak part of these tools has always been their documentation. This book is an attempt
to rectify this appalling situation and replaces the series of disjoint release notes for ANTLR,
DLG (our scanner generator), and SORCERER--the tools of the Purdue Compiler Construc-
tion Tool Set, PCCTS. I’ve also included Tom Moog’s wonderful notes for the newbie as an
appendix.
Giving credit to everyone who has significantly aided this project would be impossible, but
here is a good guess: Will Cohen and Hank Dietz were coauthors of the original PCCTS as a
whole. Russell Quong has been my partner in research for many years and is a coauthor of
ANTLR. Gary Funck and Aaron Sawdey are coauthors of SORCERER. Ariel Tamches spent a
week of his Christmas vacation in the wilds of Minnesota helping with the C++ output.
Sumana Srinivasan, Mike Monegan, and Steve Naroff of NeXT, Inc. provided extensive help in
the definition of the ANTLR C++ output; Sumana also developed the C++ grammar from
which was derived the C++ grammar available for PCCTS. Thom Wood and Randy Helzerman
both influenced the C++ output. Steve Robenalt pushed through the comp.compiler-
s.tools.pccts newsgroup and wrote the initial FAQ. Peter Dahl, then Ph.D. candidate,
and Professor Matt O’Keefe (both at the University of Minnesota) tested versions of ANTLR
extensively. Dana Hoggatt (Micro Data Base Systems, Inc.) tested 1.00 heavily. Anthony
Green at Visible Decisions, John Hall at Worcester Polytechnic Institute, Devin Hooker at
Ellery Systems, Kenneth D. Weinert at Information Handling Services, Steve Hite, and Roy
Levow at Florida Atlantic University have been faithful beta testers of PCCTS. Scott Haney at
Lawrence Livermore Labs developed the Macintosh MPW port. I thank the planning group for
the first annual PCCTS workshop sponsored by Parr Research Corporation and held at NeXT
July 25 and 26, 1994: Gary Funck, Steve Robenalt, and Ivan Kissiov. John D. Mitchell is an all
around nice guy. Finally, a multitude of PCCTS users have helped refine ANTLR with their
suggestions; I apologize for not being able to mention everyone here who has supported the
PCCTS project.
Bug reports and general words of encouragement are welcome. Please send mail to
parrt@parr-research.com
Language Translation Using PCCTS and C++ Initial Release 9
You may also wish to visit our newsgroup
comp.compilers.tools.pccts
All of the tools in PCCTS are public domain. As such, there is no guarantee that the software is
useful, will do what you want, or will will behave as you expect. There are most certainly bugs
still lurking in the code and there are probably errors in this book. I apologize for any incove-
nience in advance.
[This preface will be updated for the printed version of the book to include those who have
helped review this book etc... I will also search my mail logs for users with frequent bug reports
and add them as well].
Terence John Parr
parrt@parr-research.com
http://www.parr-research.com/~parrt
Minneapolis, Minnesota
June 1994
Preface
10 Language Translation Using PCCTS and C++ Initial Release
Language Translation Using PCCTS and C++ Initial Release 11
CHAPTER 1 Introduction
Computer language translation has become a common task. While compilers and tools for tra-
ditional computer languages (such as C, C++, FORTRAN, SMALLTALK or Java) are still
being built, their number is dwarfed by the thousands of mini-languages for which recognizers
and translators are being developed. Programmers construct translators for database formats,
graphical data files (e.g., SGI Inventor, AutoCAD), text processing files (e.g., HTML, SGML),
and application command-interpreters (e.g., SQL, EMACS); even physicists must write code
to read in the initial conditions for their finite-element computations.
Many programmers build recognizers (i.e., parsers) and translators by hand. They write a
recursive-descent parser that recognizes the input and either generates output directly, if the
translation is simple enough to allow this, or builds an intermediate representation of the input
for later translation, if the translation is complicated. Generally, some form of tree data-struc-
ture is used as an intermediate representation in this case (e.g., the input “3+4” can be conve-
niently represented by a tree with “+” at the root and “3” and “4” as leaves). In order to
manipulate or generate output from a tree, the programmer is again confronted with a recogni-
tion problem--that of matching tree templates against an input tree. As an input tree is tra-
versed, each subtree must be recognized in order to determine which translation action to
execute.
Many language tools exist to aid in translator construction and can be broadly divided into
either the parser generator category or the translator generator category. A parser generator is a
program that accepts a grammatical language description and generates a parser that recog-
nizes sentences in that language. A translator generator is a tool that accepts a grammatical lan-
guage description along with some form of translation specification and generates a program to
recognize and translate sentences in that language.
This book is a reference guide for the parser generator ANTLR, ANother Tool for Language
Recognition, and the tree-parser generator SORCERER, which is suited to source-to-source
translation. SORCERER does not fit into the translator generator category perfectly because it
is used to translate trees whereas the typical translator generator can only be used to translate
Introduction
12 Language Translation Using PCCTS and C++ Initial Release
text directly, thus, hiding any intermediate steps or data structures from the programmer. The
ANTLR and SORCERER team more closely supports what programmers would build by
hand. Specifically, ANTLR is used to recognize textual input and to generate an intermediate
form tree, which can be manipulated by a SORCERER-generated tree-walker; however, Both
tools can be used independently.
While every tool has its strengths and weaknesses, any evaluation must boil down to this: Pro-
grammers want to use tools that employ mechanisms they understand, that are sufficiently
powerful to solve their problem, that are flexible, that automate tedious tasks, and that generate
output that is easily folded into their application. Most language tools fail one or many of these
criteria. Consequently, parsers and translators are still often written by hand. ANTLR and
SORCERER have become popular because they were written specifically with these issues in
mind; i.e., they were written by programmers for programmers.
In this chapter, we describe the general features and a bit of motivation for the behavior and
style of ANTLR and SORCERER.
1.1 ANTLR
ANTLR constructs human-readable recursive-descent parsers in C or C++ from pred-LL(k)
grammars, namely LL(k) grammars, for k>1, that support predicates.
Predicates allow arbitrary semantic and syntactic context to direct the parse in a systematic
way. As a result, ANTLR can generate parsers for many context-sensitive languages and many
non-LL(k)/LR(k) context-free languages. Semantic predicates indicate the semantic validity of
applying a production; syntactic predicates are grammar fragments that describe a syntactic
context that must be satisfied before recognizing an associated production. In practice, many
ANTLR users report that developing a pred-LL(k) grammar is easier than developing the cor-
responding LR(1) grammar.
In addition to a strong parsing strategy, ANTLR has many features that make it more program-
mer-friendly than the majority of LR/LALR and LL parser generators.
• ANTLR integrates the specification of lexical and syntactic analysis. A separate lexical
specification is unnecessary as lexical regular expressions (token descriptions) can be
placed in double-quotes and used as normal token references in an ANTLR grammar.
• ANTLR accepts grammar constructs in Extended Backus-Naur Form (EBNF) notation.
• ANTLR provides facilities for automatic abstract syntax tree construction.
• ANTLR generates recursive-descent parsers in C/C++ so that there is a clear correspon-
dence between the grammar specification and the ANTLR output. Consequently, it is
relatively easy for non-parsing experts to design and debug an ANTLR grammar.
• ANTLR has both automatic and manual facilities for error recovery and reporting. The
automatic mechanism is simple and effective for many parsing situations; the manual
mechanism called “parser exception handling” simplifies development of high-quality
error handling.
• ANTLR allows each grammar rule to have parameters and return values, facilitating
attribute passing during the parse. Because ANTLR converts each rule to a C/C++ func-
tion in a recursive descent parser, a rule parameter is simply a function parameter. Addi-
tionally, ANTLR rules can have multiple return values.
SORCERER
Language Translation Using PCCTS and C++ Initial Release 13
• ANTLR has numerous other features that make it a product rather than a research
project. ANTLR itself is written in highly portable C; its output can be debugged with
existing source-level debuggers and is easily integrated into programmers’ applications.
Ultimately, the true test of a language tool’s usefulness lies with the vast industrial programmer
community. ANTLR is widely used in the commercial and academic communities. More than
1000 registered users in 37 countries have acquired the software since the original 1.00 release
in 1992. Several universities currently teach courses with ANTLR. Many commercial pro-
grammers use ANTLR.
For example, a major corporation, NeXT, has completed and is testing a unified C/Objective-
C/C++ compiler using an ANTLR grammar that was derived directly from the June 1993
ANSI X3J16 C++ grammar. [Measurements show that this ANTLR parser is about 20%
slower, in terms of pure parsing speed, than a hand-built recursive-descent parser that parses
only C/Objective-C, but not C++. The C++ grammar available for ANTLR was developed
using the NeXT grammar as a guide.] C++ has been traditionally difficult for other LL(1) tools
and LR(1)-based tools such as YACC. YACC grammars for C++ are extremely fragile with
regards to action placement; i.e., the insertion of an action can introduce conflicts into the C++
grammar. In contrast, ANTLR grammars are insensitive to action placement due to their LL(k)
nature.
The reference guide for ANTLR begins on page 21.
1.2 SORCERER
Despite the sophistication of code-generator generators and source-to-source translator genera-
tors (such as attribute grammar based tools), programmers often choose to build tree parsers by
hand for source translation problems. In many cases, a programmer has a front-end that con-
structs intermediate form trees and simply wants to traverse the trees and execute a few actions.
In this case, the optimal tree walks of code-generator generators and the powerful attribute
evaluation schemes of source-to-source translator systems are overkill; programmers would
rather avoid the overhead and complexity.
A SORCERER description is essentially an unambiguous grammar (collection of rules) in
Extended BNF notation, that describes the structure and content of a user's trees. The pro-
grammer annotates the tree grammar with actions to effect a translation, manipulate a user-
defined data structure, or manipulate the tree itself. SORCERER generates a collection of sim-
ple C or C++ functions, one for each tree-grammar rule, that recognizes tree patterns and per-
forms the programmer's actions in the specified sequence.
Tree pattern matching is done efficiently in a top-down manner via an LL(1)-based
1
parsing
strategy augmented with syntactic predicates to resolve non-LL(1) constructs (with selective
backtracking) and semantic predicates to specify any context-sensitive tree patterns. Tree tra-
versal speed is linear in the size of the tree unless a non-LL(1) construct is specified--in which
1. We build top-down parsers with one symbol of lookahead because they are usually
sufficient to recognize intermediate form trees as they are often specifically designed to
make translation easy; moreover, recursive-descent parsers provide tremendous seman-
tic flexibility.
Introduction
14 Language Translation Using PCCTS and C++ Initial Release
case backtracking can be employed selectively to recognize the construct while maintaining
near-linear traversal speed.
SORCERER can be considered an extension to an existing language rather than a total replace-
ment as other tools aspire to be. Consequently, programmers can use SORCERER to perform
the well understood, tedious, problem of parsing trees, while not limiting themselves to
describing the intended translation problem purely as attribute manipulations. SORCERER
does not force the user to use any particular parser generator or intermediate representation. Its
application interface is extremely simple and can be linked with almost any application that
constructs and manipulates trees.
SORCERER was designed to work with as many tree structures as possible because it requires
nor assumes no pre-existing application such as a parser generator. However, we have made it
particularly easy to integrate with trees built by ANTLR-generated parsers. Using the SOR-
CERER C interface, the programmer’s trees must have fields down, right, and token
(which can be redefined easily with the C preprocessor). The SORCERER C++ interface is
much less restrictive--the programmer must only define a small set of functions to allow the
tree-parser to walk the programmer’s trees (this set includes down(), right(), and
type()).
SORCERER operates in one of two modes: non-transform mode and transform mode. In non-
transform mode (the default case), SORCERER generates a simple tree parser that is best
suited to syntax-directed translation (the tree is not rewritten--a set of actions generates some
form of output). In transform mode, SORCERER generates a parser that assumes a tree trans-
formation will be done. Without programmer-intervention, the parser automatically copies the
input tree to an output tree. Each rule has an implicit (automatically defined) result tree; the
result tree of the start symbol is a pointer to the transformed tree. The various alternatives and
grammar elements may be annotated with “!” to indicate that they should not be automatically
linked into the output tree. Portions of or entire subtrees may be rewritten. A set of library
functions is available to support tree manipulations. Transform mode is specified with the
SORCERER -transform command-line option.
We are often confronted with questions regarding the applicability of SORCERER. Some peo-
ple ask why intermediate representations are used for translation. Those who are already famil-
iar with the use of trees for translation ask why they should use SORCERER instead of
building a tree walker by hand or building a C++ class hierarchy with walk() or action()
virtual member functions. Compiler writers ask how SORCERER differs from code-generator
generators and ask what SORCERER is good at. In the next two section, we address these
issues to support our design choices regarding source translation and SORCERER.
1.2.1 Intermediate Representations and Translation
The construction of computer language translators and compilers is generally broken down
into separate phases such as lexical analysis, syntactic analysis, and translation where the task
of translation can be handled in one of two ways:
1. Actions can be executed during the parse of the input text stream to generate output;
when the parser has finished recognizing the input, the translation is complete. This
type of translation is often called syntax-directed translation.
2. Actions can be executed during the parse of the input text stream to construct an
intermediate representation (IR) of the input, which will be re-examined later to per-
SORCERER
Language Translation Using PCCTS and C++ Initial Release 15
form a translation. These actions can be automatically inserted into the text parser by
ANTLR as we have shown in previous chapters.
The advantages of constructing an intermediate representation are that multiple translators can
be built without modifying the text parser, multiple simple passes over an IR can be used rather
than a single complex pass, and, because large portions of an IR can be examined quickly (i.e.,
without rewinding an input file), more complicated translations can be performed. Syntax-
directed translations are typically sufficient only for output languages that closely resemble the
input language or for languages that can be generated without having to examine large
amounts of the input stream, that is, with only local information.
For source to source translation, trees (normally called abstract syntax trees or ASTs) are the
most appropriate implementation of an IR because they encode the grammatical structure used
to recognize the input stream. For example, input string “3+4” is a valid phrase in an expres-
sion language, but does not specify what language structure it satisfies. On the other hand, the
tree structure
has the same three input symbols, but additionally encodes the fact that the “+” is an operator
and that “3” and “4” are operands. There is no point to parsing the input if your AST does not
encode at least some of the language structure used to parse the input. The structure you
encode in the AST should be specifically designed to make tree walking easy, during a subse-
quent translation phase.
An AST should be distinguished from a parse tree which encodes not only the grammatical
structure of the input, but records which rules were applied during the parse. A parse tree for
our plus-tree might look like:
which is bulkier, contains information that is unnecessary for translation (namely the rule
names), and harder to manipulate than a tree with operators as subtree roots.
If all tree structures were binary (each node had only two children), then a tree node could be
described with whatever information was needed for each node plus two child pointers. Unfor-
tunately, AST nodes often need more than two children and, worse still, the number of child
varies greatly; e.g., the most appropriate tree structure for an if-statement may be an If root
node with three children: one for the conditional, one for the then statement list, and one for
the else clause statement list. In both ANTLR and SORCERER we have adopted the child-
sibling tree implementation structure where each node has a first-child and next-sibling pointer.
The expression tree above would be structured and illustrated as
and an if-statement would be structured as
+
3 4
expr
4
factor factor+
3
+
3 4
Introduction
16 Language Translation Using PCCTS and C++ Initial Release
Child-sibling trees can be conveniently described textually in a LISP-like manner:
( parent child
1
... child
n
)
So, our expression tree could be described as
( + 3 4 )
and our If tree as
( If expr slist slist )
The contents of each AST node may contain a variety of things such as pointers into the sym-
bol table and information about the associated token object. The most important piece of infor-
mation is the token type associated with the input token from which the node was constructed.
It is used during a tree walk to distinguish between trees of identical structure but different
“contents”. For example, the tree
is considered different than
due to the differences in their token types (which we normally identify graphically via node
labels). Whether the token type is available as a C struct field or a C++ member function is
irrelevant.
Tree structures with homogeneous nodes as described here are easy to construct whereas trees
consisting of a variety of node types are very difficult to construct or transform automatically
as we demonstrate in the next section.
1.2.2 SORCERER Versus Hand-Coded Tree Walking
The question “Why is SORCERER useful when you can write a tree parser by hand?” is anal-
ogous to asking why you need a parser generator when you can write text parser by hand. The
answer is the same, although it is not a perfectly fair comparison since IRs are generally
designed to be easier to parse than the corresponding input text. Nonetheless, SORCERER
grammars have the advantage over hand-coded tree parsers because grammars:
1. are much easier and faster to write,
2. are smaller than programs,
3. are more readable as grammars directly specify the IR structure,
4. are more maintainable,
5. automatically detect malformed input trees, and
If
expr slist slist
*
a b
+
3 4
SORCERER
Language Translation Using PCCTS and C++ Initial Release 17
6. can possibly detect ambiguities/nondeterminisms in your tree description (such as
when two different patterns have the same root token) that might be missed when
writing a tree walker by hand.
We further note that parsing a tree is the same as parsing a text stream except that the tree
parser must match a two-dimensional stream rather than a one-dimensional stream.
Because a variety of techniques are available to perform tree walks and translations, we present
the common C and C++ hand-coding techniques and illustrate why SORCERER grammars
often represent more elegant solutions.
In C, given homogeneous tree structures, there are two possible tree-walking strategies:
1.
A simple recursive depth-first search function applies a translation function to each
node in the tree indiscriminately. The translation function would have to test the
node to which it was applied in order to perform the necessary task. Any translation
function would have trouble dealing with multiple-node subtrees such as those con-
structed for if-statements. The structure of the IR is not tested for deformities.
2.
A hand-built parser explicitly tests for the IR structure such as “if root node is a If
node and the first child is an expression, ...”. SORCERER is a simply a shorthand for
this strategy.
Alternatively, you can use C structures with fields that point to the required children rather than
a list of children nodes. We defer the discussion of nonhomogeneous tree nodes to the C++ dis-
cussion below.
In C++, given tree nodes with homogeneous behavior, you could make each node in the tree an
appropriate class that had a walk() member function. The walk() function would do the
appropriate thing depending on what type of tree node it was. However, you would end up
building a complete tree parser by hand. The class member PLUSNode::walk() would be
analogous to a rule plus_expr. For example,
class PLUSNode : public AST {
walk()
{
MATCH(PLUS); // match the root
this->down()->walk(); // walk left operand
this->down()->right()->walk(); // walk right operand
}
...
}
versus
plus_expr
: #( PLUS expr expr )
;
where expr would nicely group all the expression templates in one rule. I.e.,
expr
: plus_expr
| mult_expr
...
;
Introduction
18 Language Translation Using PCCTS and C++ Initial Release
whereas in the hand-coded version, there could be no explicit specification for what an expres-
sion tree looks like--there is just a collection C++ classes with similar names such as PLUSN-
ode, MULTNode, and so on:
class PLUSNode : public AST { walk(); ... };
class MULTNode : public AST { walk(); ... };
On the other hand, if we used a variety of tree node types, a set of class members could point to
the appropriate information rather than using a generic list of children. For example,
class EXPRNode : public AST {...};
class PLUSNode : public EXPRNode {
EXPR *left_opnd;
EXPR *right_opnd;
walk()
{
left_opnd->walk();
right_opnd->walk();
}
};
However, a walk() function is still needed to specify what to do and in what order. A set of
member pointers is not nearly as powerful as a grammar because a grammar can specify oper-
and order and sequences of operands. The order of operands is important during translation
when you want to generate an appropriate sequence of output (i.e., what if the field names were
Bill and Ted instead of left_opnd and right_opnd? While these are silly names, the
point is made that you have to encode order in the names of the fields). The ability to specify
sequences is analogous to allowing you to specify structures of varying size. For example, the
tree structure to describe a function call expression would have to be encoded as follows
class FUNCCallNode : public EXPRNode {
char *id;
List<EXPRNode *> arguments;
walk()
{
for ( each element of arguments list )
arg->walk();
}
};
Because the number of arguments is unknown at compile-time, a list of arguments must be
maintained and walked by hand whereas, using SORCERER, since everything is represented
as a generic list of nodes, you could easily describe such IR structures:
func_call
: #( FUNCCall ID ( expr )* )
;
No matter how fancy you get using C or C++, you must still describe your IR structure at least
partially with hand-written code rather than with a grammar.
The only remaining reason to have a variety of node class types is to specify what translation
action to execute for each node. Action execution too is better handled grammatically. Actions
embedded within the SORCERER grammar dictate what to do and, according to action posi-
tion, when to do it. In this manner, it is very obvious what actions are executed during the tree
SORCERER
Language Translation Using PCCTS and C++ Initial Release 19
walk. With the hand-coded C++ approach, you would have to peruse the class hierarchy to dis-
cover what would happen at each node. Further, there may be cases where you have two iden-
tical nodes in an IR structure for which you would like to perform two different actions
depending on their context. With the grammatical approach, you would simply place a differ-
ent action at each node reference. The hand-coded approach would force you to make action
member functions sensitive to their surrounding tree context, which is difficult and cumber-
some.
We have shown in this section that a grammar is more appropriate for describing the structure
of an IR than a hand-coded C function or C++ class hierarchy with a set of walk() member
functions and that child-sibling trees are very convenient tree-structure implementation. Trans-
lations can also be performed more easily by embedding actions within a grammar rather than
scattering the actions around a class hierarchy. On the other hand, we do not stop you from
walking trees with a variety of class types--SORCERER will pretend, however, that your tree
consists only of nodes of a single type.
1.2.3 What Is SORCERER Good At and Bad At?
SORCERER is not the “Silver Bullet” of translation. It was designed specifically to support
source-to-source translations via a set of embedded actions that generate output directly or via
a set of tree transformation actions.
SORCERER is good at
• Describing tree structures (just as LISP is good at it).
• Syntax-directed translations.
• Tree transformations either local such as constant folding and tree normalizations or
global such as adding declarations for implicitly defined variables.
• Interpreting trees such as for scripting languages.
SORCERER is not good at or does not support
• Optimized assembly code generation.
• Construction of “use-def” chains, data-flow dependency graphs, and other common
compiler data structures; although, SORCERER can be used to traverse statement lists
to construct these data structures with user-supplied actions.
1.2.4 How Is SORCERER Different Than a Code-Generator Generator?
Compiler code-generator generators are designed to produce a stream of assembly langauge
instructions from an input tree representing the statements in a source program. Because there
may be multiple assembly instructions for a single tree pattern (e.g., integer addition and incre-
ment), a code-generator generator must handle ambiguous grammars. An ambiguous grammar
is one for which an input sequence may be recognized in more than one way by the resulting
parser (i.e., there is more than one grammatical derivation). A “cost” is provided for each tree
pattern in the grammar to indicate how “expensive” the instruction issued by the pattern would
be in terms of execution speed or memory space. The code-generator finds the optimal walk of
the input tree, which results in the optimal assembly instruction stream. SORCERER differs in
the following ways:
1. Because code-generators must choose between competing grammar alternatives,
they must match the entire alternative before executing a translation action. How-
Introduction
20 Language Translation Using PCCTS and C++ Initial Release
ever, the ability to execute a translation action at any point during the parse is indis-
pensable for source-to-source translation.
2. Code-generator generators were not designed for and cannot perform tree rewrites.
3. Code-generator generators normally do not allow EBNF grammar constructs to
specify lists of elements etc....
4. While code-generator generators handle unambiguous grammars like SORCERER’s
as well as ambiguous grammars, they may not handle unambiguous grammars as
efficiently as a tool specifically tuned for fast deterministic parsing.
It is ironic that most translator generators are code-generator generators, even though most
translation problems do not involve compilation. Unfortunately, few practical tools like SOR-
CERER exist for the larger scope of source-to-source translation.
The reference guide for SORCERER begins on page 83.
Language Translation Using PCCTS and C++ Initial Release 21
CHAPTER 2 ANTLR Reference
This chapter describes how to construct parsers via ANTLR grammars, how to interface a
parser to your application, and how to insert actions to generate output. Unless otherwise spec-
ified, actions and other source code is C++.
[Professors Russell Quong, Hank Dietz, and Will Cohen all have contributed greatly to the
overall development of PCCTS in general. In particular, much of the intellectual property of
ANTLR was conceived with Russell Quong.]
2.1 ANTLR Descriptions
Generally speaking an ANTLR description consists of a collection of lexical and syntactic
rules, which describe the language to be recognized, and a collection of user-defined semantic
actions which describe what to do with the input sentences as they are recognized. A single
grammar may be broken up into multiple files and multiple grammars may be specified within
a single file, but the basic sequence follows something like:
header action
actions
token definitions
rules
actions
token definitions
ANTLR Reference
22 Language Translation Using PCCTS and C++ Initial Release
For example, the following is a complete ANTLR description that recognizes the vocabulary of
B. Simpson:
<<
typedef ANTLRCommonToken ANTLRToken;
#include "DLGLexer.h"
main() {
ParserBlackBox<DLGLexer, // create a parser
BSimpsonParser,
ANTLRToken> bart(stdin);
bart.parser()->a(); // invoke parser
}
>>
#token "[\ \t\n]+" <<skip();>> // ignore whitespace
#token MAN "man"
class BSimpsonParser {
a : "no" "way" MAN
| "don't" "have" "a" "cow" "man"
;
}
More precisely, ANTLR descriptions conform to the following grammar:
grammar
: ( "#header" ACTION
| "#parser" STRING
| "#tokdefs" STRING
)*
{ "class" ID "\{" }
( ACTION | lexaction | directive | global_exception_handler )*
( rule | directive )+
( ACTION | directive )*
{ "\}" }
( ACTION | directive )*
;
directive
: lexclass | token_def | errclass_def | tokclass_def
;
where the following lexical items apply:
TABLE 1. Lexical Items in an ANTLR Description
Token Name Form Example
ACTION <<...>> <<int i;>>
<<define(id->getText());>>
STRING “...” “[a-z]+” “begin”
“includefile.h” “test”
ANTLR Descriptions
Language Translation Using PCCTS and C++ Initial Release 23
There is no start rule specification per se as any rule can be invoked first.
2.1.1 Comments
Both C and C++ style comments are allowed within the grammar (outside of actions) regard-
less of the language used within actions. For example,
/* here is a rule */
args : ID ( "," ID )* ; // match a list of ID’s
The comments used within your actions is determined, naturally, by the language you are pro-
gramming in.
2.1.2 #header Directive
Any C or C++ code that must be visible to files generated by ANTLR must placed in an action
at the start of your description preceded by the #header directive. This directive is necessary
when using the C interface and optional with the C++ interface. Turn on ANTLR command
line option -gh when using the C interface if the function that invokes the parser is in a non-
ANTLR-generated file. See Section 2.8 on page 71.
2.1.3 #parser Directive
Because C does not have the notion of a package or module, linking more than one ANTLR-
generated parser together causes multiply-defined symbol errors (due to the global variables
defined in each parser). The solution to the problem is to prefix all global ANTLR symbols
with a user-defined string in order to make the symbols unique to the C linker. The #parser
is used to specify this prefix. A file called remap.h is generated that contains a sequence of
redefinitions for the global symbols. For example,
#parser foo
would generate a remap.h file similar to:
#define your_rule foo_your_rule
#define zztokenLA xyz_zztokenLA
#define AST xyz_AST
...
TOKEN [A-Z][a-z0-9_]+ ID KeyBegin IntForm1
RULE [a-z][a-z0-9_]+ expr statement func_def
ARGBLOCK [...] [34] [int i, float j]
ID [a-zA-Z][a-z0-9_]+ CParser label
SEMPRED <<...>>? <<isType(id->getText())>>?
TABLE 1. Lexical Items in an ANTLR Description
Token Name Form Example
ANTLR Reference
24 Language Translation Using PCCTS and C++ Initial Release
2.1.4 Parser Classes
When using the C++ interface, you must specify the name of the parser class by enclosing all
rules in
class Parser {
...
}
A parser class results in a subclass of ANTLRParser in the resulting parser. A parser object is
simply a set of actions and routines for recognizing and performing operations on sentences of
a language. Consequently, it is natural to have many separate parser objects; e.g., when recog-
nizing include files.
Currently, exactly one parser class may be defined. For the defined class, ANTLR generates a
derived class of ANTLRParser.
Actions may be placed within the parser class scope and may contain any C++ code that is
valid within a C++ class definition--any variable or function declarations will become mem-
bers of the class in the resulting C++ output. For example,
class Parser {
<<public: int i;>>
<<int f() { blah; }>>
rule : A B <<f();>> ;
}
Here, variable i and function f are members of class Parser which would become a sub-
class of ANTLRParser in the resulting C++ code.
The actions at the head of the parser class are collected and placed near the head of the result-
ing C++ class; the actions at the end of the parser class are similarly collected and placed near
the end of the resulting parser class definition.
The Parser.h file generated by ANTLR for this parser class would resemble:
class Parser : public ANTLRParser {
protected:
static ANTLRChar *_token_tbl[];
public: int i;
int f() { blah; }
static SetWordType setwd1[4];
public:
Parser(ANTLRTokenBuffer *input);
Parser(ANTLRTokenBuffer *input, TokenType eof);
void rule(void);
};
ANTLR Descriptions
Language Translation Using PCCTS and C++ Initial Release 25
2.1.5 Rules
An ANTLR rule describes a portion of the input language and consists of a list of alternatives;
rules also contain code for error handling and argument or return value passing. The exact syn-
tax of a rule is the following:
rule
: RULE { "!" } { ARGBLOCK }
{ ">" ARGBLOCK }
{ STRING } ":"
block ";"
{ ACTION }
( exception_group )*
;
block
: alt ( exception_group )* ( "\|" alt ( exception_group )* )*
;
alt : { "\@" } ( { "\~" } element )*
;
token: TOKEN | STRING
;
element
: { ID ":" }
( token { ".." token } { "^" | "!" } { "\@" }
| "." { "^" | "!" }
| RULE { "!" } { ARGBLOCK } { ">" ARGBLOCK }
)
| ACTION
| SEMPRED
| "\(" block "\)" { "\*" | "\+" | "?" }
| "\{" block "\}"
;
Therefore, a rule looks like:
rule : alternative
1
| alternative
2
...
| alternative
n
;
where each alternative production is composed of a list of elements which can be references to
rules, references to tokens, actions, predicates, and subrules. Argument and return value defini-
tions looks like the following where there are n arguments and m return values.
rule[arg
1
,...,arg
n
] > [retval
1
,...,retval
m
] : ... ;
The syntax for using a rule mirrors its definition,
a : ... rule[arg
1
,...,arg
n
] > [v
1
,...,v
m
] ...
;
ANTLR Reference
26 Language Translation Using PCCTS and C++ Initial Release
Here, the various v
i
receive the return values from the rule rule, so that each v
i
must be an l-
value.
If the first element of the rule is an action, that action is an init-action and is executed once
before recognition of the rule begins and is the place to define local variables.
2.1.6 Subrules (EBNF Descriptions)
A subrule is the same as a rule without a label and, hence, has no arguments or return values.
There are three subrules to choose from
If the first element of the whole subrule is an action, that action is an init-action and is executed
once before recognition of the subrule begins--even if the subrule is a looping construct. Fur-
ther, the action is executed always even if the subrule matches nothing.
Put in pictures of syntax diagrams...
2.1.7 Rule Elements
2.1.7.1 Actions
Actions are of the form <<...>> and contain programmer-supplied C or C++ code that must
be executed during the parse. Init-actions are actions that are the very first element of a rule or
subrule; they are executed before the rule or subrule recognizes anything and can be used to
define local variables. Fail-actions are placed after the ‘;’ in a rule definition and are executed
if an error occurred while parsing the rule (unless exception handlers are used). Any other
action is executed immediately after the preceding rule element and before any following ele-
ments.
2.1.7.2 Semantic Predicate
A semantic predicate has two forms:
• <<...>>? This form represents a C or C++ expression that must evaluate to true
before recognition of elements beyond it in the rule are authorized for recognition.
• ( lookahead-context )? => <<...>>? This form is simply a more specific
form as it indicates that the predicate is only valid under a particular lookahead context;
e.g., the following predicate indicates that the isTypeName() test is only valid if the
first symbol of lookahead is an identifier.
( ID )? => <<isTypeName(LT(1)->getText())>>?
Typically, semantic predicates are used to specify the semantic validity of a particular produc-
tion and, therefore, most often are placed at the extreme left edge of productions.
TABLE 2. ANTLR Subrule Format
Name Form Example
plain subrule (...)
(ID | INT)
zero-or-more (...)*
ID ( “,” ID )*
one-or-more (...)+
( declaration )+
optional {...}
{ “else” statement }
ANTLR Descriptions
Language Translation Using PCCTS and C++ Initial Release 27
You should normally allow ANTLR to compute the lookahead context (ANTLR command line
option “-prc on”). See “Predicates” on page 56.
2.1.7.3 Syntactic Predicate
Syntactic predicates are of the form (...)? specify the syntactic context under which a pro-
duction will succussfully match. They are useful in situations where normal LL(k) parsing is
inadequate. For example,
a : ( list "=" )? list "=" list
| list
;
See “Predicates” on page 56.
2.1.7.4 Tokens, Token Classes and Token Operators
Token references indicate the token that must be matched on the input stream and are either
identifiers beginning with an uppercase letter or are regular expressions enclosed in double
quotes. A token class looks just like a token reference, but has an associated #tokclass def-
inition and indicates the set of tokens that can be matched on the input stream.
The range operator has the form T
1
.. T
n
and specifies a token class containing the set of
tokens from T
1
up to T
n
inclusively.
The not operator has the form ~T and specifies the set of all tokens defined in the grammar
except for T.
See “Token Classes” on page 33.
2.1.7.5 Rule References
Rule references indicate that another rule should be invoked to recognize part of the input
stream. The rule may be passed some arguments and may return some values. Rules are inden-
tifiers that begin with a lowercase letter. For example,
a : <<int i;>> b[34] > [i]
;
b[int j] > [int k]
: A B <<$k = $j + 1;>> //return argument + 1
;
2.1.7.6 Labels
All rules, token, and token class references may be labeled with an identifier. Identifiers are
mainly used to access the attribute (C interface) or token object (C++ interface) of tokens. Rule
labels are used primarily by the exception handling mechanism to make a group of handler
specific to a rule invocation.
Labels may begin with either an upper or lowercase letter; e.g., id:ID ER:expr.
Actions in an ANTLR grammar may access attributes via labels of the form ``$label'' attached
to token rather than the conventional $i for some integer i. By using symbols instead of inte-
ger identifiers, grammars are more readable and action parameters are not sensitive to changes
in rule element positions. The form of a label is:
label:element
ANTLR Reference
28 Language Translation Using PCCTS and C++ Initial Release
where element is either a token reference or a rule reference. To refer to the attribute (C
interface) or token pointer (C++ interface) of that element in an action, use
$label
within an action or rule argument list. For example,
a : t:ID <<printf("%s\n", $t->getText());>>
;
using the C++ interface. To reference the tree variable associated with element, use
#label
When using parser exception handling, simply reference label to attach a handler to a partic-
ular rule reference. For example,
a : t:b
exception[t]
default : <<trap any error found during call to 'b'>>
;
Labels must be unique per rule as they have rule scope. Labels may be accessed from parser
exception handlers.
See “Attributes Using the C Interface” on page 77, “Accessing Token Objects From Grammar
Actions” on page 41, and “Parser Exception Handlers” on page 36.
2.1.7.7 AST Operators
When constructing ASTs, ANTLR assumes that any nonsuffixed token is a leaf node in the
resulting tree. To inform ANTLR that a particular token should not be included in the output
AST, suffix the token with ‘!’. Rules may also be suffixed with ‘!’ to indicate that the tree
constructed by the invoked rule should not be linked into the tree constructed for the current
rule. Any token suffixed with the ‘^’ operator is considered a root token--a tree node is con-
structed for that token and is made the root of whatever portion of the tree has been built; e..g,
a : A B^ C^ ;
results in the following tree:
First A is matched and made a lonely child, followed by B which is made the parent of the cur-
rent tree--A. Finally, C is matched and made the parent of the current tree--making it the parent
of the B node.
2.1.7.8 Exception Operator
See “The @ Operator” on page 71.
2.1.8 Multiple ANTLR Description Files
ANTLR descriptions may be broken up into many different files, but the
sequence mentioned above in grammar must be maintained.
C
B
A
Lexical Directives
Language Translation Using PCCTS and C++ Initial Release 29
For example, if file f1.g contained
#header <<#include "int.h">>
<< main() { ANTLR(start(), stdin); } >>
and file f2.g contained
start : "begin" VAR "=" NUM ";" "end" "." "@" ;
and file f3.g contained
#token VAR "[a-z]+"
#token NUM "[0-9]+"
the correct ANTLR invocation would be
antlr f1.g f2.g f3.g
Note that the order of files f2.g and f3.g could be switched. In this case, to comply with
ANTLR's description meta-language, the only restriction is that file f1.g must be mentioned
first on the command line.
Other files may be included into the parser files generated by ANTLR via actions containing a
#include directive. For example,
<<#include “support_code.h”>>
If a file (or anything else) must be included in all parser files generated by ANTLR, the
#include directive must be placed in the #header action. In other words,
#header <<#include “necessary_type_defs_for_all_files.h”>>
Note that #include's can be used to define any ANTLR object (Attrib, AST, etc...) by
placing it in the #header action.
2.2 Lexical Directives
2.2.1 Token Definitions
Tokens are defined either explicitly with #token or implicitly by using them as rule elements.
Implicitly defined tokens can be either regular expressions (non-identified tokens) or token
names (identified). Token names begin with an upper-case letter (rules begin with a lower-case
letter). More than one occurrence of the same regular expression in a grammar description
produces a single regular expression in lexical description passed to DLG (parser.dlg) and
is assigned one token type number. Regular expressions and token identifiers that refer to the
same lexical object (input pattern) may be used interchangeably. Token identifiers that are ref-
erenced, but not attached to a regular expression are simply assigned a token type and result in
a #define definition only. It is not necessary to label any regular expressions with an identi-
fier in ANTLR. However, all token types that you wish to explicitly refer to in an action, must
be declared with a #token instruction.
The user may introduce tokens, lexical actions, and token identifiers via the #token directive.
Specifically,
• Simply declare a token for use in a user action:
#token VAR
ANTLR Reference
30 Language Translation Using PCCTS and C++ Initial Release
• Associate a token with a regular expression and, optionally, an action:
#token ID "[a-zA-Z][a-zA-Z0-9]*"
#token Eof "@" << printf("Eof Found\n"); >>
• Specify what must occur upon a regular expression:
#token "[0-9]+" <<printf("Found an int\n");>>
Lexical actions tied to a token definition may access the following variables, functions, and
macros:
TABLE 3. C++ Interface Symbols Available to Lexical Actions
Symbol Description
replchar(DLGchar c) Replace the text of the most recently matched
lexical object with c. You can erase the current
expression text by sending in a ‘\0’.
replstr(DLGchar *s) Replace the text of the most recently matched
lexical object with s.
int line() The current line number being scanned by DLG.
newline() Maintain DLGLexer::_line by calling this
function when a newline character is seen; just
increments _line.
more()
This function merely sets a flag that tells DLG to
continue looking for another token; future char-
acters are appended to the current token text.
skip() This function merely sets a flag that tells DLG to
continue looking for another token; future char-
acters are not appended to the current token text.
advance() Instruct DLG to consume another input charac-
ter. ch will be set to this next character.
int ch The most recently scanned character.
DLGchar *lextext() The entire lexical buffer containing all characters
matched thus far since the last token type was
returned. See more() and skip().
DLGchar *begexpr() Beginning of last token matched.
DLGchar *endexpr() Pointer to the last character of last token
matched.
trackColumns() Call this function to get DLG to track the column
numbers.
int begcol() The column number starting from 1 of the first
character of the most recently matched token.
Lexical Directives
Language Translation Using PCCTS and C++ Initial Release 31
2.2.2 Regular Expressions
The input character stream is broken up into vocabulary symbols (tokens) via regular expres-
sions--a meta-language similar to the ANTLR EBNF description language. ANTLR collects
all of the regular expressions found within your grammar (both those defined implicitly within
the grammar and those defined explicitly via the #token directive) and places them in a file
that is converted to a scanner by DLG. Table 4 on page 32 describes the set of regular expres-
sions available to you.
2.2.3 Token Order and Lexical Ambiguities
The order in which regular expressions are found in the grammar description file(s) is signifi-
cant. When the input stream contains a sequence of characters that match more than one regu-
int endcol() The column number starting from 1 of the last
character of the most recently matched token.
Reset the column to 0 when a newline is encoun-
tered. Adjust the column also in the lexical action
when a character is not one print position wide
(e.g. tabs or non-printing characters). The col-
umn information is not immediately updated if a
token's action calls more().
set_begcol(int a) Set the current token column number for the
beginning of the token.
set_endcol(int a) Set the current token column number for the
beginning of the token.
DLGchar
The type name of a character read by DLG. This
is typedef’d to char by default, but it could
be a class or another atomic type.
errstd(char *) Called automatically by DLG to print an error
message indicating that the input text matches no
defined lexical expressions. Override in a sub-
class to redefine.
mode(int m) Set the lexical mode (i.e., lexical class or autom-
aton) corresponding to a lex class defined in an
ANTLR grammar with the #lexclass direc-
tive. Yes, this is very poorly named.
setInputStream(
DLGInputStream *)
Specify that the scanner should read characters
from the indicated input stream (e.g., file, string,
function).
saveState(DLGState *) Save the current state of the scanner. You do not
really need this function.
restoreState(
DLGState *)
Restore the state of the scanner from a state
buffer.
TABLE 3. C++ Interface Symbols Available to Lexical Actions
Symbol Description
ANTLR Reference
32 Language Translation Using PCCTS and C++ Initial Release
lar expression, (e.g. one regular expression is a subset of another) the scanner is confronted
with a dilemma. The scanner does not know which regular expression to match, hence, it does
not know which action should be performed. To resolve the ambiguity, DLG (the scanner gen-
erator) assumes that the regular expression which is defined earliest in the grammar should
take precedence over later definitions. Therefore, tokens that are special cases of other regular
expressions should be defined before the more general regular expressions. For example, a
keyword is a special case of a variable and thus needs to occur before the variable definition.
#token KeywordBegin “begin”
...
#token ID "[a-zA-Z][a-zA-Z0-9]*"
2.2.4 Token Definition Files (#tokdefs)
It is often the case that the user is interested in specifying the token types rather than having
ANTLR generate its own; typically, this situation arises when the user wants to link an
TABLE 4. Regular Expression Syntax
Expression Description
a|b Matches either the pattern a or the pattern b.
(a) Matches the pattern a. Pattern a is kept as an indivisible unit.
{a} Matches a or nothing, i.e., the same as (a|).
[a] Matches any single character in character list a; e.g., [abc]
matches either an a, b or c and is equivalent to (a|b|c).
[a-b] Matches any of the single characters whose ASCII codes are between
a and b inclusively, i.e., the same as (a|...|b).
~[a] Matches any single character except for those in character list a.
~[] Matches any single character; literally “not nothing.”
a* Matches zero or more occurrences of pattern a.
a+ Matches one or more occurrences of pattern a, i.e., the same as
aa*.
@ Matches end-of-file.
\t Tab character.
\n Newline character.
\r Carriage return character.
\b Backspace character.
\a Matches the single character a--even if a by itself would have a dif-
ferent meaning, e.g., \+ would match the + character.
\0nnn Matches character that has octal value nnn
\0xnn Matches character that has hexadecimal value nnn.
\mnn Matches character with decimal value mnn, 1≤m≤9.
Lexical Directives
Language Translation Using PCCTS and C++ Initial Release 33
ANTLR-generated parser with a non-DLG-based scanner (perhaps an existing scanner). To
get ANTLR to use pre-assigned token types, specify
#tokdefs "mytokens.h"
before any token definitions where mytokens.h is a file with only a list of #defines or an
enum definition with optional comments.
When this directive is used, new token identifier definitions will not be allowed (either explicit
definitions like “#token A” or implicit definitions such as a reference to a token label in a
rule). However, you may attach regular expressions and lexical actions to the token labels
defined in mytokens.h. For example, if mytokens.h contained:
#define A 2
and t.g contained:
#tokdefs "mytokens.h"
#token A "blah"
a : A B;
ANTLR would report the following error message:
Antlr parser generator Version 1.32 1989-1995
t.g, line 3: error: implicit token definition not allowed with #tokdefs
This refers to the fact that token identifier B was not defined in mytokens.h and ANTLR has
no idea how to assign the token identifier a token type number.
Only one token definition file is allowed.
As is common in C and C++ programming, “gates” are used to prevent multiple inclusions of
include files. ANTLR knows to ignore the following two lines at the head of a token definition
file:
#ifndef id
1
#define id
2
No check is made to ensure that id
1
and id
2
are the same or that they conform to any particu-
lar naming convention (such as the name of the file suffixed with “_H”).
The following items are ignored inside your token definition file: whitespace, C style com-
ments, C++ style comments, #ifdef, #if, #else, #endif, #undef, #import. Any-
thing other than these ignored symbols, #define, #ifndef, or a valid enum statement
yield lexical errors.
2.2.5 Token Classes
A token class is set of tokens that can be referenced as one entity; token classes are equivalent
to subrules consisting of the member tokens separated by “|“s. The basic syntax is:
#tokclass Tclass { T
1
... T
n
}
where Tclass is a valid token identifier (begins with an uppercase letter) and T
i
is a token ref-
erence (either a token identifier or a regular expression in double-quotes) or a token class refer-
ence; token classes may have overlapping tokens. Referencing Tclass is the same as
referencing a rule of the form
tclass : T
1
| ... | T
n
;
ANTLR Reference
34 Language Translation Using PCCTS and C++ Initial Release
The difference between a token class and a rule lies in efficiency. A reference to a token class is
a simple set membership test during parser execution rather than a linear search of the tokens
in a rule (or subrule). Furthermore, the set membership will be much smaller than a series of
if-statements in a recursive-descent parser. Note that automaton-based parsers (both LL and
LALR) automatically perform this type of set membership (specifically, a table lookup), but
lack the flexibility of recursive-descent parsers such as those constructed by ANTLR.
A predefined wildcard token class, identified by a dot, is available to represent the set of all
defined tokens. For example,
ig : "ignore_next_token" . ;
The wildcard is sometimes useful for ignoring portions of the input stream, however, lexical
classes are often more efficient at ignoring input. Wildcard can also be used for error handling
as an “else-alternative”.
if : "if" expr "then" stat
| . <<fprintf(stderr, "malformed if-statement");>>
;
You should be careful not to do things like this:
ig : "begin"
( . )*
"end"
;
because the loop generated for the “(.)*” block will never terminate--”end” is also matched
by the wildcard. Rather than using the wildcard to match large token classes, it is often best to
use the not operator. For example,
ig : "begin"
( ~"end" )*
"end"
;
where “~” is the not operator and implies a token class containing all tokens defined in the
grammar except the token (or tokens in a token class) modified by the operator. The if exam-
ple could be rewritten as:
if : "if" expr "then" stat
| ~"if" <<fprintf(stderr, "malformed if-statement");>>
;
The not operator may be applied to token class references and token references only (it may
not be applied to subrules, for example). The wildcard operator and the not operator never
result in a set containing the end-of-file token type.
Token classes can also be created via the range operator of the form T
1
.. T
n
. The token type
of T
1
must be less than T
n
and the values in between should be valid token types. In general,
this feature should be used in conjunction with #tokdefs so that you control the token type
values. An example range operator is:
#tokdefs "mytokens.h"
a : operand OpStart .. OpEnd operand ;
Lexical Directives
Language Translation Using PCCTS and C++ Initial Release 35
where mytokens.h contains
#define Add 1
#define Sub 2
#define Mul 3
#define OpStart 1
#define OpEnd 3
This feature is perhaps unneeded due to the more powerful token class directive:
#tokclass Op { Add Sub Mul }
a : operand Op operand ;
2.2.6 Lexical Classes
ANTLR parsers employ DFA's (Deterministic Finite Automatons) created by DLG to match
tokens found on the character input stream. More than one automaton (lexical class) may be
defined in PCCTS. Multiple scanners are useful in two ways. First, more than one grammar
can be described within the same PCCTS input file(s). Second, multiple automatons can be
used to recognize tokens that seriously conflict with other regular expressions within the same
lexical analyzer (e.g. comments, quoted-strings, etc...).
Actions attached to regular expressions (which are executed when that expression has been
matched on the input stream) may switch from one lexical analyzer to another. Each analyzer
(lex class) has a label which is used to enter that automaton. A predefined lexical class called
START is in effect from the beginning of the PCCTS description until the user issues a #lex-
class directive or the end of the description is found.
When more than one lexical class is defined, it is possible to have the same regular expression
and the same token label defined in multiple automatons. Regular expressions found in more
than one automaton are given different token numbers, but token labels are unique across lexi-
cal class boundaries. For instance,
#lexclass A
#token LABEL "expr1"
#lexclass B
#token LABEL "expr2"
In this case, LABEL is the same token type number (#define in C or enum in C++) for both
expr1 and expr2. A reference to LABEL within a rule can be matched by two different reg-
ular expressions depending on which automaton is currently active.
Hence, the #lexclass directive marks the start of a new set of lexical definitions. Rules
found after a #lexclass can only use tokens defined within that class--i.e., all tokens
defined until the next #lexclass or the end of the PCCTS description, whichever comes
first. Any regular expressions used explicity in these rules are placed into the current lexical
class. Since the default automaton, START, is active upon parser startup, the start rule must be
defined within the boundaries of the START automaton. Typically, a multiple-automaton gram-
mar will begin with
#lexclass START
immediately before the rule definitions to insure that the rules use the token definitions in the
“main” automaton.
ANTLR Reference
36 Language Translation Using PCCTS and C++ Initial Release
Tokens are given sequential token numbers across all lexical classes so that no conflicts arise.
This also allows the user to reference zztokens[token_num] (which is a string represent-
ing the label or regular expression defined in the grammar) regardless of which class
token_num is defined in.
2.2.6.1 Multiple grammars, multiple lexical analyzers
Different grammars will generally require separate lexical analyzers to break up the input
stream into tokens. What may be a keyword in one language, may be a simple variable in
another. The #lexclass directive is used to group tokens into different lexical analyzers.
For example, to separate two grammars into two lexical classes,
#lexclass GRAMMAR1
rules for grammar1
#lexclass GRAMMAR2
rules for grammar2
All tokens found beyond the #lexclass directive will be considered of that class.
2.2.6.2 Single grammar, multiple lexical analyzers
For most languages, some characters are interpreted differently depending on the syntactic
context; comments and character strings are the most common examples. Consider the recog-
nition of C style comments.
#lexclass C_COMMENT
#token "[\n\r]" <<skip(); newline();>>
#token "\*/" <<mode(START); skip();>>
#token "\*~[/]" <<skip();>>
#token "~[\*\n\r]+" <<skip();>>
#lexclass START
#token "/\*" <<mode(C_COMMENT); skip();>>
2.2.7 Lexical Actions
It is sometimes convenient or necessary to have a section of user C code placed in the lexical
analyzer constructed automatically by DLG; for example, you may need to provide extern
definitions for variables or functions defined in the parser, but used in token actions. Normally,
actions not associated with a #token directive or embedded within a rule are placed in the
parser generated by ANTLR. However, preceding an action appearing outside of any rule,
with the #lexaction pseudo-op directs the action to the lexical analyzer file. For example,
<< /* a normal action outside of the rules */ >>
#lexaction
<< /* this action is inserted into the lexical
* analyzer created by DLG
*/
>>
All #lexaction actions are collected and placed as a group into the C or C++ file where the
lexer resides. Typically, this code consists of functions or variable declarations needed by
#token actions.
Actions
Language Translation Using PCCTS and C++ Initial Release 37
2.2.8 Error Classes
The default syntax error reporting facility generates a list of tokens that could be possibly
matched when the erroneous token was encountered. Often, this list is large and means little to
the user for anything but small grammars. For example, an expression recognizer might gener-
ate the following error message for an invalid expression, “a b”:
syntax error at "b" missing { "\+" "\-" "\*" "/" ";" }
A better error message would be
syntax error at "b" missing { operator ";" }
This modification can be accomplished by defining the error class:
#errclass "operator" { "\+" "\-" "\*" "/" }
The general syntax for the #errclass directive is as follows:
#errclass label { T
1
... T
n
}
where label is either a quoted string or a label (capitalized just like token labels). The
quoted string must not conflict with any rule name, token identifier or regular expression.
Groups of expressions will be replaced with this string. The error class elements, T
i
, can be
labeled tokens, regular expressions or rules. Tokens (identifiers or regular expressions) refer-
enced within an error class must at some point in the grammar be referenced in a rule or explic-
itly defined with #token. The definition need not appear before the #errclass definition.
If a rule is referenced, the FIRST set (set of all tokens that can be recognized first upon entering
a rule) for that rule is included in the error class. The -ge command-line option can be used to
have ANTLR generate an error class for each rule of the form:
#errclass Rule { rule }
where the error class name is the same as the rule except that the first character is converted to
uppercase.
2.3 Actions
Actions are embedded within your grammar to effect a translation. Without actions, ANTLR
grammars result in a simple recognizer, which answers yes or no as to whether an input sen-
tence was valid. This section describes where actions may occur within an ANTLR grammar,
when they are executed, and what special terms they may reference (e.g., for attributes).
Actions are of the form “<<...>>” (normal action) or “[...]” (argument or return value
block).
2.3.1 Placement
There are three main positions where actions may occur
• Outside of any rule. These actions may not contain executable code unless it occurs
within a completely-specified function. Typically, these actions contain variable and
function declarations as would normally be found in a C or C++ program. These actions
ANTLR Reference
38 Language Translation Using PCCTS and C++ Initial Release
will be placed in the global scope in the resulting parser. Consequently, all other actions
have access to the declarations given in these global actions. For example,
<<
extern int from_elsewhere;
enum T { X, Y, Z };
main()
{
...
}
>>
a : <<T b=X; printf("starting a");>>
blah
;
• Within a rule or immediately following the rule. These actions are executed during
the recognition of the input and must be executable code unless they are init-actions, in
which case, they may contain variable declarations as well. Actions immediately fol-
lowing the ‘;’ of a rule definition are fail-actions and are used to clean up after a syntax
error (these are less useful now due to parser exception handlers). For example,
rule : <<init-action>>
... <<normal action>> ...
;
<<fail-action>>
• As a rule argument or return value block. These actions either define arguments and
return values or they specify the value of arguments and return values; their behavior is
identical to that of normal C/C++ functions except that ANTLR allows you to define
more than one return value. For example,
code_block[Scope s] > [SymbolList localsyms]
: <<Symbol *sym;>>
"begin" decl[$s] > [sym] <<$localsyms.add(sym);>> "end"
;
2.3.2 Time of Execution
Actions placed among the elements in the productions of a rule are executed immediately fol-
lowing the recognition of the preceding grammar element whether than element is a simple
token reference or large subrule.
Init-actions are executed before anything has been recognized in a subrule or rule. Init-actions
of subrules are executed regardless of whether or not anything is matched by the subrule. Fur-
ther, init-actions are always executed during guess mode; i.e., while evaluating a syntactic
predicate.
Fail-actions are used only when parser exception handlers are not used and are executed upon a
syntax error within that rule.
2.3.3 Interpretation of Action Text
ANTLR generally ignores what you place inside actions with the exception that certain expres-
sion terms are available to allow easy access to attributes (C interface), token pointers (C++
interface), and trees. The following tables describe the various special symbols recognized by
ANTLR inside [...] and <<...>> actions for the C and C++ interface.
Actions
Language Translation Using PCCTS and C++ Initial Release 39
Comments (both C and C++), characters, and strings are ignored by ANTLR. To escape ‘$’
and ‘#’, use ‘\$’ and ‘\#’
Table 6 on page 39 provides a brief description of the available AST expressions. See Table 7
on page 55 for a more complete description
TABLE 5. C++ Interface Interpretation of Terms in Actions
Symbol Meaning
$j The token pointer for the jth element (which
must be a token reference) of the current alterna-
tive. The counting includes actions. Subrules
embedded within the alternative are counted as
one element. There is no token pointer associated
with subrules, actions, or rule references.
$i.j The token pointer for the jth element of ith level
starting from the outermost (rule) level at 1.
$0
Invalid. No translation. There is no token pointer
associated with rules.
$$ Invalid. No translation.
$arg The rule argument labeled arg.
$rule Invalid. No translation.
$rv The rule return result labeled rv. (l-value)
$[token_type,text] Invalid. There are no attributes using the C++
interface.
$[] Invalid.
TABLE 6. Synopsis of C/C++ Interface Interpretation of AST Terms in Actions
Symbol Meaning
#0 A pointer to the result tree of the enclosing rule.
(l-value).
#i A pointer to the AST built (or returned from) the
ith element of the enclosing alternative.
#label A pointer to the AST built (or returned from) the
elemented labeled with label. Translated to
label_ast.
#[args] Tree node constructor. Translated to a call to
zzmk_ast(zzastnew(), args) in C.
In C++, it is translated to “new AST(args)”.
#[] Empty tree node constructor.
ANTLR Reference
40 Language Translation Using PCCTS and C++ Initial Release
2.3.4 Init-Actions
Init-actions are used to define local variables and to, optionally, execute some initialization
code for a rule or subrule. The init-action of a rule is executed exactly once--before any in the
rule has been executed. It is not executed unless the rule is actually invoked by another rule or
a user action (such as main routine). For example,
a : <<int i;>>
a:INT <<i = atoi(a->getText());>>
| ID <<i = 0;>>
;
The init-action of a subrule is always executed regardless of whether the subrule matches any
input. For example,
a : ( <<int i=3;>> ID )*
/* i is local to the (...)* loop and initialized only once */
{ <<f = 0;>> b:FLOAT <<f=atof(b->getText());>> }
/* f is 0 if a FLOAT was not found */
;
Init-actions should not reference attribute or token pointer symbols such as $label.
2.3.5 Fail Actions
Fail actions are actions that are placed immediately following the ‘;’ rule terminator. They are
executed after a syntax error has been detected but before a message is printed and the
attributes have been destroyed (optionally with zzd_attr()). However, attributes are not
valid here because one does not know at what point the error occurred and which attributes
even exist. Fail actions are often useful for cleaning up data structures or freeing memory. For
example,
a : <<List *p=NULL;>>
( Var <<append(p, $1);>> )+
<<OperateOn(p); rmlist(p);>>
;
<<rmlist(p);>>
The ( )+ loop matches a lists of variables (Vars) and collections them in a list. The fail-
action <<rmlist(p);>> specifies that if and when a syntax error occurs, the elements are to
be free()ed.
Fail-actions should not reference attribute or token pointer symbols such as $label.
#(root, child1, ...,
childn)
Tree constructor.
#() NULL.
TABLE 6. Synopsis of C/C++ Interface Interpretation of AST Terms in Actions
Symbol Meaning
C++ Interface
Language Translation Using PCCTS and C++ Initial Release 41
2.3.6 Accessing Token Objects From Grammar Actions
The C++ interface parsing-model specifies that the parser accepts a stream of token pointers
rather than a stream of simple token types such as is done using the C interface parsing-model.
Rather than accessing attributes computed from the text and token type of the input token, the
C++ interface allows direct access to the stream of token objects created by the scanner. You
may reference $label within the actions of a rule where label is a label attached to a token
element defined within the same alternative. For example,
def : "var" id:ID ";" <<behavior->defineVar($id->getText());>>
In this case, $id is a pointer to the token object created by the scanner (via the makeTo-
ken() function) for the token immediately following the keyword var on the input stream.
Normally, you will subclass ANTLRTokenBase or simply use ANTLRCommonToken as the
token object class; therefore, functions getText() and getLine() can be used to access
the “attributes” of the token object.
LT(i)???
LA, etc...
2.4 C++ Interface
[The ANTLR C++ interface was designed cooperatively with Sumana Srinivasan, Mike Mone-
gan, and Steve Naroff at NeXT, Inc.]
When generating recursive-descent parsers in C++, ANTLR uses the flexibility of C++ classes
in two ways to create modular, reusable code. First, ANTLR will generate parser classes in
which the class member functions, rather than global functions, contain the code (i) to recog-
nize rules and (ii) to perform semantic actions. Second, ANTLR uses snap-together classes for
the input, the lexer, and the token buffer. Figure 1 on page 41 shows the files generated by
ANTLR and DLG for grammar class Parser and grammar file file.g.
FIGURE 1 Files Generated By ANTLR, DLG
An ANTLR parser consists of one or more C++ classes, called parser classes. Each parser class
recognizes and translates part (or all) of a language. The recursive-descent recognition routines
and the semantic actions are member functions of this class. A parser object is an instantiation
(or variable) of the parser class.
ANTLR
Parser.C Parser.h file.C parser.dlg tokens.h
DLG
DLGLexer.C DLGLexer.h
ANTLR Reference
42 Language Translation Using PCCTS and C++ Initial Release
To specify the name of the parser class in an ANTLR grammar description, enclose the appro-
priate rules and actions in a C++ class definition, as follows.
class Expr {
<<int i;>>
<<
public:
void print();
>>
e : INT ("\*" INT)* ;
... // other grammar rules
}
ANTLR would generate a parser class Expr that looks like the following. The types ANTLR-
TokenBuffer and TokenType are discussed below.
class Expr : public ANTLRParser {
public:
Expr(ANTLRTokenBuffer *input);
Expr(ANTLRTokenBuffer *input, TokenType eof);
int i;
void print();
void e();
private:
internal-Expr-specific-data;
};
2.4.1 The Utility of C++ Classes in Parsing
It is natural to have many separate parser objects. For example, if parsing ANSI C code, we
might have three parser classes for C expressions, C declarations, and C statements. Parsing
multiple languages or parts of languages simply involves switching parsers objects. For exam-
ple, if you had a working C language front-end for a compiler, to evaluate C expressions in a
debugger, just use the parser object for C expressions (and modify the semantic actions via vir-
tual functions as described below).
Using parser classes has the standard advantages of C++ classes involving name spaces and
encapsulation of state. Because all routines are class member functions, they belong in the
class namespace and do not clutter the global name space, reducing (or greatly simplifying) the
problem of name clashes. Lastly, a parser object encapsulates the various state needed during a
parse or translation.
While the ability to cleanly instantiate and invoke multiple parsers is useful, the main advan-
tage of parser classes is that they can be extended in an object-oriented fashion. By using the
inheritance and virtual functions mechanisms of C++, a parser class can be used as the base
class (superclass) for a variety of similar but non-identical uses. Derived parser classes would
be specialized for different activities; in many cases, these derived classes need only redefine
C++ Interface
Language Translation Using PCCTS and C++ Initial Release 43
translation actions, as they inherit the grammar rules, as these recursive-descent routines are
member functions, from the base class. For example,
class CPP_Parser {
<<
virtual void defineClass(char *cl);
>>
cdef
: "class" id:ID "\{" ... "\}" <<defineClass(id->getText());>>
;
...
}
To construct a browser, you might subclass CPP_Parser to override defineClass() so
that the function would highlight the class name on the screen; e.g.,
class CPP_Browser {
CPP_Browser(ANTLRTokenBuffer *in) : CPP_Parser(in) { }
void defineClass(char *cl) { highlight(cl); }
};
A C++ compiler might override defineClass() to add the symbol to the symbol table.
Alternatively, the behavior of a parser can be delegated to a behavior object such that actions in
the parser would be of the form
<<behavior->triggerSomeAction();>>
2.4.2 Invoking ANTLR Parsers
The second way ANTLR uses C++ classes is to have separate C++ classes for the input stream,
the lexical analyzer (scanner), the token buffer, and the parser. Conceptually, these classes fit
together as shown in Figure 2 on page 43, and in fact, the ANTLR-generated classes “snap
together” in an identical fashion. To initialize the parser, you simply
1.
attaches an input stream object to a DLG-based scanner; if the user has constructed
their own scanner, they would attach it here.
2. attaches a scanner to a token buffer object, and
3. attaches the token buffer to a parser object generated by ANTLR.
FIGURE 2 Overview of the C++ classes generated by ANTLR.
ANTLRTokenBuffer ANTLRParser
DLGInputStream output
DLGLexer
ANTLR Reference
44 Language Translation Using PCCTS and C++ Initial Release
The following code illustrates, for a parser object Expr, how these classes fit together.
main()
{
DLGFileInput in(stdin); // get an input stream for DLG
DLGLexer scan(&in); // connect a scanner to an input stream
ANTLRTokenBuffer pipe(&scan); // connect scanner, parser via pipe
ANTLRToken aToken;
scan.setToken(&aToken); // DLG needs vtbl to access virtual func
Expr parser(&pipe); // make a parser connected to the pipe
parser.init(); // initialize the parser
parser.e(); // begin parsing; e = start symbol
}
where ANTLRToken is programmer-defined and must be a subclass of ANTLRAbstract-
Token. To start parsing, it is sufficient to call the Expr member function associated with the
grammar rule; here, e is the start symbol. Naturally, this explicit sequence is a pain to type so
we have provided a “black box” template:
main()
{
ParserBlackBox<DLGLexer, Expr, ANTLRToken> p(stdin);
p.parser()->e();
}
To ensure compatibility among different input streams, lexers, token buffers, and parsers, all
objects are derived from one of the four common bases classes DLGInputStream,
DLGLexer, ANTLRTokenBuffer or ANTLRParser. All parsers are derived from a com-
mon base class ANTLRParser.
2.4.3 ANTLR C++ Class Hierarchy
This section describes the important class relationships defined by the C++ interface. Figure 3
on page 51 highlights the relationship graphically while the following subsections provide
details for each class.
2.4.3.1 Token Classes
Each token object passed to the parser must satisify at least the interface defined by class ANT-
LRTokenBase if ANTLR is to do error reporting for you. Specifically, ANTLR token objects
know their token type, line number, and associated input text.
class ANTLRTokenBase {
public:
virtual int getLine();
virtual void setLine(int line);
virtual ANTLRChar *getText();
virtual void setText(ANTLRChar *);
virtual ANTLRLightweightToken *
makeToken(TokenType tt, ANTLRChar *txt, int line);
};
C++ Interface
Language Translation Using PCCTS and C++ Initial Release 45
However, if you wish to do error reporting on your own (by redefining ANTLRPars-
er::syn()), then the minimal token object interface is (note the lack of a virtual table; the
object is the size of a TokenType variable member).
class ANTLRLightweightToken {
public:
TokenType getType();
void setType(TokenType t);
};
For those of you who really need a bizarre token (e.g., if you cannot get away with using a
TokenType member variable as your token object) you can subclass ANTLRAbstract-
Token. To use this class, functions ANTLRParser::LA() and ANTLRParser::LT()
must be redefined.
The common case is that you will use the ANTLRTokenBase interface. For your convenience,
we have provided a token class, ANTLRCommonToken, that will work “out of the box.” It has
a fixed text field that is used to store the text of token found on the input stream.
Some readers may wonder why function makeToken() is required at all and why you have
to pass the address of an ANTLRToken into the DLG-based scanner during parser initializa-
tion. Why cannot the constructor be used to create a token and so on? The reason lies with the
scanner, which must construct the token objects. The DLG support routines are typically in a
precompiled object file that is linked in regardless of your token definition. Hence, DLG must
be able to create tokens of any type.
Because objects in C++ are not “self-conscious” (i.e., they do not know their own type), DLG
has no idea what the appropriate constructor is. Constructors cannot be virtual anyway; so, we
had to come up with a constructor that is virtual and that acts like a factory--it returns the
address of a new token object upon each invocation rather than just initializing an existing
object.
Because classes are not first-class objects in C++ (i.e., you cannot pass class names around),
we must pass DLG the address of an ANTLRToken token object so DLG has access to the
appropriate virtual table and is, thus, able to call the appropriate makeToken(). This weird-
ness would disappear if all objects knew their type or if class names were first-class objects.
Here is the code fragment in DLG that constructs the token objects that are passed to the parser
via the ANTLRTokenBuffer:
ANTLRAbstractToken *DLGLexerBase::
getToken()
{
if ( token_to_fill==NULL ) panic(“NULL token_to_fill”);
TokenType tt = nextTokenType();
DLGBasedToken *tk = (DLGBasedToken *)
token_to_fill->makeToken(tt, _lextext, _line);
return tk;
}
ANTLR Reference
46 Language Translation Using PCCTS and C++ Initial Release
2.4.3.2 Scanners and Token Streams
The raw stream of tokens coming from a scanner is accessed via an ANTLRTokenStream.
The required interface is simply that the token stream must be able to answer the message
getToken():
class ANTLRTokenStream {
public:
virtual ANTLRAbstractToken *getToken() = 0;
};
To use your own scanner, subclass ANTLRTokenStream and define getToken() or have
getToken() call the appropriate function in your scanner. For example,
class MyLexer : public ANTLRTokenStream {
private:
char c;
public:
MyLexer();
virtual ANTLRAbstractToken *getToken();
};
DLG scanners are all (indirect) subclasses of ANTLRTokenStream.
2.4.3.3 Token Buffer
The parser is “attached” to an ANTLRTokenBuffer via interface functions: getToken()
and bufferedToken(). The object that actually consumes characters and constructs
tokens, a subclass of ANTLRTokenStream is connected to the ANTLRTokenBuffer via
interface function ANTLRTokenStream::getToken(). This strategy isolates the infinite
lookahead mechanism (used for syntactic predicates) from the parser and provides a “sliding
window” into the token stream.
The ANTLRTokenBuffer begins with k token object pointers where k is the size of the loo-
kahead specified on the ANTLR command line. The buffer is circular when the parser is not
evaluating a syntactic predicate (that is, when ANTLR is guessing during the parse); when a
new token is consumed the least recently read token pointer is discarded. When the end of the
token buffer is reached during a syntactic predicate evaluation, however, the buffer grows so
that the token stream can be rewound to the point at which the predicate was initiated. The
buffer can only grow--never shrink.
By default, the token buffer does not delete any tokens as they are discarded--it is up to the
user to delete them or not at an appropriate time. If the user calls ANTLRParser::dele-
teTokens(), then token pointers are deleted as they are discarded. This may delete a
token object for which you still would like access; unfortunately, this can happen at inconve-
nient times (the bigger the buffer, call ANTLRTokenBuffer::setMinTokens(), the less
frequently this will occur); copy the information from token objects into your data structures
such as your symbol table as soon as possible. [This is, of course, a lousy way to handle things
and we hope to implement reference counting for token objects in the near future.]
The token object pointers in the token buffer may be accessed from your actions with ANTLR-
Parser::LT(i), where i=1..n where n is the number of token objects remaining in the
file; LT(1) is a pointer to the next token to be recognized. This function can be used to write
sophisticated semantic predicates that look deep into the rest of the input token stream to make
complicated decisions. For example, the C++ qualified item construct is difficult to match
C++ Interface
Language Translation Using PCCTS and C++ Initial Release 47
because there may be an arbitrarily large sequence of scopes before the object can be identified
(e.g., A::B::~B()).
The ANTLRParser::LA(i) function returns the token type of the i
th
lookahead symbol,
but is valid only for i=1..k. This function uses a cache of k tokens stored in the parser itself--
the token buffer itself is not queried.
2.4.3.4 Parsers
ANTLR generates a subclass of ANTLRParser called P for definitions in your grammar file of
the form:
class P {
...
}
The functions of interest that you may wish to invoke or override are:
class ANTLRParser {
public:
virtual void init();
Note: you must call ANTLRParser::init() if you override init().
TokenType LA(int i);
The token type of the i
th
symbol of lookahead where i=1..k.
ANTLRTokenBase *LT(int i);
The token object pointer of the i
th
symbol of lookahead where i=1..n (n is the number
of tokens remaining in the input).
void setEofToken(TokenType t);
When using non-DLG-based scanners, you must inform the parser what token type
should be considered end-of-input. This token type will then be used by the error
ANTLR Reference
48 Language Translation Using PCCTS and C++ Initial Release
recovery facilities to scan past bogus tokens without going beyond the end of the
input.
void deleteTokens();
Any token pointer discarded from the token buffer is deleted if this function is
called.
virtual void syn(ANTLRTokenBase *tok, ANTLRChar *egroup,
SetWordType *eset, TokenType etok, int k);
You can redefine syn() to change how ANTLR resports error messages; see ede-
code() below.
void panic(char *msg);
Call this if something really bad happens--the parser will terminate.
virtual void consume();
Get another token of input.
void consumeUntil(SetWordType *st); // for exceptions
This function forces the parser to consume tokens until a token in the token class
specified (or end-of-input) is found. That token is not consumed--you may want to
call consume() afterwards.
void consumeUntilToken(int t);
Consume tokens until the specified token is found (or end of input). That token is not
consumed--you may want to consume() afterwards.
protected:
void edecode(SetWordType *);
Print out in set notation the specified token class. Given a token class called T in your
grammar, the set name will be called T_set in an action.
virtual void tracein(char *r);
This function is called upon entry into rule r.
virtual void traceout(char *r);
This function is called upon exit from rule r.
};
2.4.3.5 AST Classes
ANTLR’s AST definitions are subclasses of ASTBase, which is derived from PCCT_AST (so
that the SORCERER and ANTLR trees have a common base).
C++ Interface
Language Translation Using PCCTS and C++ Initial Release 49
The interesting functions are as follows:
class PCCTS_AST {
// minimal SORCERER interface
virtual PCCTS_AST *right();
Return next sibling.
virtual PCCTS_AST *down();
Return first child.
virtual void setRight(PCCTS_AST *t);
Set the next sibling.
virtual void setDown(PCCTS_AST *t);
Set the first child.
virtual int type();
What is the node type (used by SORCERER).
virtual void setType(int t);
Set the node type (used by SORCERER).
virtual PCCTS_AST *shallowCopy();
Return a copy of the node (used for SORCERER in transform mode).
// not needed by ANTLR--support functions; see SORCERER doc
virtual PCCTS_AST *deepCopy();
virtual void addChild(PCCTS_AST *t);
virtual void insert_after(PCCTS_AST *a, PCCTS_AST *b);
virtual void append(PCCTS_AST *a, PCCTS_AST *b);
virtual PCCTS_AST *tail(PCCTS_AST *a);
virtual PCCTS_AST *bottom(PCCTS_AST *a);
virtual PCCTS_AST *cut_between(PCCTS_AST *a, PCCTS_AST *b);
virtual void tfree(PCCTS_AST *t);
virtual int nsiblings(PCCTS_AST *t);
virtual PCCTS_AST *sibling_index(PCCTS_AST *t, int i);
void virtual panic(char *err);
Print an error message and terminate the program.
};
ANTLR Reference
50 Language Translation Using PCCTS and C++ Initial Release
ASTBase is a subclass of PCCTS_AST and adds the functionality:
class ASTBase : public PCCTS_AST {
public:
ASTBase *dup();
Return a duplicate of the tree.
void destroy();
Delete the entire tree.
static ASTBase *tmake(ASTBase *, ...);
Construct a tree from a possibly NULL root (first argument) and a list of children--fol-
lowed by a NULL argument.
void preorder();
Do a preorder traversal of a tree (normally used to print out a tree in LISP form).
virtual void preorder_action();
What to do at each node during the traversal.
virtual void preorder_before_action();
What to do before descending a down pointer link (i.e., before visiting the children
list). Prints a left parenthesis by default.
virtual void preorder_after_action();
What to do upon return from visiting a children list. Prints a right parenthesis by
default.
};
To use doubly-linked child-sibling trees, subclass ASTDoublyLinkedBase instead:
class ASTDoublyLinkedBase : public ASTBase {
public:
void double_link(ASTBase *left, ASTBase *up);
Set the parent (up) and previous child (left) pointers of the whole tree. Initially, left
and up arguments to this function must be NULL.
PCCTS_AST *left() { return _left; }
Return the previous child.
PCCTS_AST *up() { return _up; }
Return the parent (works for any sibling in a sibling list).
};
Note, however, that the tree routines from ASTBase do not update the left, and up pointers.
You must call double_link() to update all the links in the tree.
Intermediate-Form Tree Construction
Language Translation Using PCCTS and C++ Initial Release 51
FIGURE 3 C++ Class Hierarchy
2.5 Intermediate-Form Tree Construction
You may embed actions within an ANTLR grammar to construct abstract-syntax-trees, ASTs,
but ANTLR provides an automatic mechanism for implicitly or explicitly specifying tree struc-
tures. Using the automatic mechanism, you must define what an AST node looks like and how
to construct an AST node given an attribute (C) or token pointer (C++); the ANTLR -gt com-
mand line option must be turned on so that ANTLR knows to generate the extra code in the
resulting parser to construct and manipulate trees. In this section, we describe the required C or
C++ definitions, the available support functions, the AST operators, and the special symbols
used in actions to construct nodes and trees.
2.5.1 Required AST Definitions
The C++ interface requires that you derive a class called AST from ASTBase. The derived
class contains the fields needed by you for his or her purposes and a constructor that accepts an
ANTLRToken pointer; the constructor fills in the AST node from the contents of the token
ANTLRAbstractToken
ANTLRLightweightToken
ANTLRTokenBase
DLGBasedToken
ANTLRCommonToken
MyToken
ANTLRTokenStream
DLGLexerBase
DLGLexer
ANTLRTokenStream
DLGLexerBase
DLGLexer
PCCTS_AST
ASTBase
DLGLexer ASTDoublyLinkedBase
ANTLRParser
MyParser
ANTLR Reference
52 Language Translation Using PCCTS and C++ Initial Release
object. Here is a example AST node definition that merely copies the entire token object for
which the node was created:
class AST : public ASTBase {
public:
ANTLRToken token;
AST(ANTLRToken *t) { token = *t; }
};
The calling of rules from C++ code is slightly different when trees are being built. As with the
C interface, the address of a NULL-initialized tree pointer must be passed to the starting rule;
the pointer will come back set to the tree constructed for that rule:
main()
{
ParserBlackBox<...> p(stdin);
ASTBase *root = NULL;
p.parser->start_rule((ASTBase **)&root);
}
2.5.2 AST Support Functions
The following group of PCCTS_AST members are needed by SORCERER, but are available
to ANTLR programmers.
PCCTS_AST *right();
Return the next sibling to the right of this.
PCCTS_AST *down();
Return the first child of this.
void setRight(PCCTS_AST *t);
Set the next sibling to the right to t.
void setDown(PCCTS_AST *t);
Set the first child to t.
int token();
Return the token type of this.
void setToken(int t);
Set the token type to t.
The following PCCTS_AST members are not needed by ANTLR or SORCERER, but are sup-
port functions available to both; they are mainly useful to SORCERER applications.
void addChild(PCCTS_AST *t);
Add a child t of this.
PCCTS_AST *find_all(PCCTS_AST *t,
PCCTS_AST *u,
PCCTS_AST **cursor);
Find all occurrences of u in t. The cursor must be initialized to NULL before calling this
function and is used to keep track of where in t the function left off between function
Intermediate-Form Tree Construction
Language Translation Using PCCTS and C++ Initial Release 53
calls. Returns NULL when no more occurrences are found. It is useful for iterating over
every occurrence of a particular subtree.
int match(PCCTS_AST *t, PCCTS_AST *u);
Returns true if t and u have the same structure--that is, the trees have the same tree
structure and token type fields. If both trees are NULL, true is returned.
void insert_after(PCCTS_AST *a, PCCTS_AST *b);
Add b immediately after a as it sibling. If b is a sibling list at its top level, then the last
sibling of b points to the previous right sibling of a. Inserting a NULL pointer has no
effect.
void append(PCCTS_AST *a, PCCTS_AST *b);
Add b to the end of a’s sibling list. Appending a NULL pointer is illegal.
PCCTS_AST *tail(PCCTS_AST *a);
Find the end of a’s sibling list and return a pointer to it.
PCCTS_AST *bottom(PCCTS_AST *a);
Find the bottom of a’s child list (going straight “down”).
PCCTS_AST *cut_between(PCCTS_AST *a, PCCTS_AST *b);
Unlink all siblings between a and b and return a pointer to the first element of the sibling
list that was unlinked. Basically, all this routine does is to make a point to b and make
sure that the tail of the sibling list, which was unlinked, does not point to b. The routine
ensures that a and b are (perhaps indirectly) connected to start with.
void tfree(PCCTS_AST *t);
Recursively walk t, deleting all the nodes in a depth-first order.
int nsiblings(PCCTS_AST *t);
Returns the number of siblings of t.
PCCTS_AST *sibling_index(PCCTS_AST *t, int i);
Return a pointer to the ith sibling where the sibling to the right of t is the first. An index
of 0, returns t.
The following ASTBase members are specific to ANTLR:
static ASTBase *tmake(ASTBase *, ...);
See the #(...) in “Interpretation of C/C++ Actions Related to ASTs” on page 55.
ASTBase *dup();
Duplicate the current tree.
void preorder();
Perform a preorder walk of the tree using the following member functions.
void preorder_action();
What to do in each node as you do a preorder walk. Typically, preorder() is used to
print out a tree in lisp-like notation. In that case, it is sufficient to redefine this function
alone.
void preorder_before_action();
What to print out before walking down a tree.
void preorder_after_action();
What to print out after walking down a tree.
2.5.3 Operators
ANTLR grammars may be annotated with AST construction operators. The operators are suffi-
cient to describe all but the strangest tree structures. Table 10 on page 28 summarizes the oper-
ators.
Consider the “root” operator ‘^’. The token modified by the root operator is considered the
root of the currently-built AST. As a result, the rule
add : INT PLUS^ INT ;
ANTLR Reference
54 Language Translation Using PCCTS and C++ Initial Release
results in an AST of the form:
The way to “read” rule add with regards to AST building is to say
“Make a node for INT and add it to the sibling list (it is a parent-less only child now).
Make a node for PLUS and make it the root of the current tree (which makes it simply
the parent of the only child). Make a node for the second INT and add it to the sibling
list.”
Think of the AST operators as actions that get executed as they are encountered not as some-
thing that specifies a tree structure known at ANTLR analysis-time. For example, what if a
looping subrule is placed in the rule:
add : INT (PLUS^ INT)* ;
Input “3+4+5” would yield:
After the “3+4” has been read, the current tree for rule add would be:
just as before. However, because the (...)* allows us to match another addition term, two
more nodes will be added to the tree for add, one of which is a root node. After the recognition
of the second ‘+’ in the input, the tree for add would look like:
because the ‘+’ was made the root of the current tree due to the root operator. The ‘5’ is a sim-
ple leaf node (since it was not modified with either ‘^’ or ‘!’) and is added to the current sib-
ling list. The addition of the new root changes the current sibling list from the 3 and 4 list to the
first ‘+’ that was added to the tree; i.e., the first child of the new root. Hence, the ‘5’ is added to
the second level in the tree and we arrive at the final tree structure.
PLUS
INT INT
+
+
3 4
5
+
3 4
+
+
3 4
Intermediate-Form Tree Construction
Language Translation Using PCCTS and C++ Initial Release 55
Subrules merely modify the tree being built for the current rule whereas each rule has its own
tree. For example, if the (...)* in add were moved to a different rule:
add: INT addTerms
;
addTerms
: (PLUS^ INT)*
;
then the following, very different, tree would be generated for input “3+4+5”:
While this tree structure is not totally useless, it is not as useful as the previous structure
because the ‘+’ operators are not at the subtree roots.
2.5.4 Interpretation of C/C++ Actions Related to ASTs
Actions within ANTLR grammar rules may reference expression terms that are not valid C or
C++ expressions, but are understood and translated by ANTLR. These terms are useful, for
example, when you must construct trees too complicated for simple grammar annotation, when
nodes must be added to the trees built by ANTLR, or when partial trees must be examined.
Table 11 on page 28 describes these terms.
TABLE 7. C/C++ Interface Interpretation of AST Terms in Actions
Symbol Meaning
#0 A pointer to the result tree of the enclosing rule.
(l-value).
#i A pointer to the AST built (or returned from) the
ith element of the enclosing alternative.
#label A pointer to the AST built (or returned from) the
elemented labeled with label. Translated to
label_ast.
#[args] Tree node constructor. Translated to a call to
zzmk_ast(zzastnew(), args) in C
where zzastnew() allocates and returns a
pointer to a new, initialized AST node. You must
define zzmk_ast() if this term is used.
In C++, it is translated to “new AST(args)”.
#[] Empty tree node constructor. Translated to a call
to zzastnew() in C and to “new AST” in
C++.
3 +
4
+
5
ANTLR Reference
56 Language Translation Using PCCTS and C++ Initial Release
Consider how one might build an AST for an if-statement. A useful tree may quickly and eas-
ily be constructed via grammar annotations:
if : IF^ expr THEN! stat { ELSE! stat } ;
Here, the IF is identified as the root of the tree, the THEN and ELSE are omitted as unneces-
sary, and the trees constructed by expr and stat are linked in as children of the IF node:
To construct the same tree structure manually, the following grammar is sufficient:
if!: IF e:expr THEN st:stat { ELSE el:stat }
<<#if = #(#[IF], #e, #st, #el);>>
;
where the ‘!’ in the rule header indicates that no automatic tree construction should be done by
ANTLR. The “#[IF]” term constructs a tree node via “new AST(IF)” (assuming you have
defined an AST constructor taking a TokenType argument) and the “#(...)” tree construc-
tor puts the IF node above the children matched for the conditional and statements. The el
label is initialized to NULL and, hence, contributes nothing to the resulting tree if an else-
clause is not found on the input.
2.6 Predicates
Predicates are used to recognize difficult language construct such as those that are context-sen-
stive or those that require unbounded lookahead to recognize. This section provides a brief
description of how predicates are defined and used to alter the normal LL(k) parsing strategy.
2.6.1 Semantic Predicates
Semantic predicates alter the parse based upon run-time information. Generally, this informa-
tion is obtained from a symbol table and is used to recognize context-sensitive constructs such
as those that are syntactically identical, but semantically very different; e.g., variables and type
names are simple identifiers, but are used in completely different contexts.
The basic semantic predicate takes the form of an action suffixed with the ‘?’ operator:
“<<...>>?”. No white space is allowed between the “>>” and the ‘?’. Predicates must be
boolean expressions and may not have side effects (i.e., they should not modify variables);
#(root child1, ...,
childn)
Tree constructor. If root is NULL, then a sibling
list is returned. The childi arguments are
added to the sibling list until the first NULL
pointer is encountered (not counting the root
pointer).
#() NULL.
TABLE 7. C/C++ Interface Interpretation of AST Terms in Actions
Symbol Meaning
IF
expr stat stat
Predicates
Language Translation Using PCCTS and C++ Initial Release 57
alternatives without predicates are assumed to have a predicate of “<<1>>?”. Also, because
predicates can be hoisted, out of rules as we will see shortly, predicates that refer to rule
parameters or local variables are often undesirable.
2.6.1.1 Validating Semantic Predicates
All semantic predicates behave at least as validating predicates. That is, all predicates must
evaluate to true as the parser encounters them during the parse or a semantic error occurs. For
example,
typename
: <<isTypeName(LT(1)->getText())>>? ID
;
When typename is invoked, the predicate is tested before attempting to match the ID token
reference; where isTypeName() is some user-defined boolean function. If the first symbol
of lookahead is not a valid type name, ANTLR will generate an error message indicating that
the predicate failed.
A fail action may be specified by appending a “[...]” action; this action is executed upon
failure of the predicate when acting as a validating predicate:
typename
: <<isTypeName(LT(1)->getText())>>?[dialogBox(BadType)] ID
;
where we can presume that dialogBox(BadType) is a user-defined function that opens a
dialog box to display an error message. ANTLR generates code similar to the following:
void Parser::typename(void)
{
if (!(isTypeName(LT(1)->getText()))) dialogBox(BadType) ;
zzmatch(ID);
consume();
return;
}
using the C++ interface.
[We anticipate a future release in which a semantic error throws a parser exception rather than
simply executing the fail action.]
2.6.1.2 Disambiguating Semantic Predicates
When ANTLR finds a syntactic ambiguity in your grammar, ANTLR attempts to resolve the
ambiguity with semantic information. In other words, ANTLR searches the grammar for any
predicates that provide semantic information concerning the tokens in the lookahead buffer. A
predicate that is tested during the parse to make a parsing decision (as opposed to merely
checking for validity once a decision has been made) is considered a disambiguating predicate.
We say that disambiguating predicates are hoisted into a parsing decision from a rule or rules.
A predicate that may be hoisted into a decision is said to be visible to that decision. In this sec-
tion, we describe which predicates are visible, how multiple predicates are combined, and how
visible predicates are incorporated into parsing decisions. See Chapter ### for a complete
description of how predicates may be used.
ANTLR Reference
58 Language Translation Using PCCTS and C++ Initial Release
ANTLR searches for semantic predicates when a syntactically ambiguous parsing decision is
discovered. The set of visible predicates is collected and tested in the appropriate prediction
expression. We say that predicate p is visible to an alternative (and, hence, may be used to pre-
dict that alternative) if p can be evaluated correctly without consuming another input token and
without executing a user action. Generally, visible predicates reside on the left edge of produc-
tions; predicates not on the left edge usually function as validating predicates only. For exam-
ple,
a : <<p1>>? ID
| b
;
b : <<p2>>? ID
| <<action>> <<p3>>? ID
| INT <<p4>>? ID
| { FLOAT } <<p5>>? ID
;
First we observe that a lookahead of ID predicts both alternatives of rule a. ANTLR will
search for predicates that potentially disambiguate the parsing decision. Here, we see that p1
may be used to predict alternative one because it can be evaluated without being hoisted over a
grammar element or user action. Alternative two of rule a has no predicate, but the alternative
references rule b which has two predicates. Predicate p2 is visible, but p3 is not because p3
would have to be executed before action, which would change the action execution order.
Predicates p4 is not visible because an INT token would have to be consumed before p4 could
be evaluated in the correct context. Predicate p5 is not visible because a FLOAT token may
have to be consumed to gain the correct context. Rule a could be coded something like the fol-
lowing:
a()
{
if ( LA(1)==ID && (p1) ) {
MATCH(ID);
consume();
}
else ( LA(1)==ID && (p2) ) {
b();
}
}
Predicates may be hoisted over init-actions because init-actions are assumed to contain merely
local variable allocations. E.g.,
a : <<init-action>> // does not affect hoisting
<<p1>>? ID
| b
;
Predicates
Language Translation Using PCCTS and C++ Initial Release 59
Care must be taken so that predicates do not refer to local variables or rule parameters if the
predicate could be hoisted out of that rule. I.e.,
a : b | ID
;
b[int ctx]
: <<ctx>>? ID
;
Predicate ctx will be hoisted into rule a, resulting in a C or C++ compilation error because
ctx exists only in rule b.
Alternatives without predicates are assumed to be semantically valid; hence, predicates on
some alternatives are redundant. For example,
a : <<flag>>? ID
| <<!flag>>?ID
;
The predicate on the second alternative is unnecessary because if flag evaluates to false,
!flag is redundant.
A predicate used to help predict an alternative may or may not apply to all lookahead
sequences predicting that alternative. We say that the lookahead must be consistent with the
context of the predicate for it to provide useful information. Consider the following example.
a : (var | INT)
| ID
;
var: <<isVar(LATEXT(1))>>? ID
;
Because ID predicts both alternatives of rule a, ANTLR will hoist predicate isVar() into
the prediction expression for alternative one. However, both INT and ID predict alternative
one--evaluating isVar() when the lookahead is INT would be incorrect as it would return
false when in fact no semantic information is known about INTs. Alternative one of rule a
would never be able to match an INT.
When hoisting a predicate, ANTLR computes and then carries along the context under which
the predicate was found (with “-prc on” command-line option). The required depth, k, for
the predicate context is determined by examining the actual predicate to see what lookahead
depths are used; predicates that do not reference LT(k) or LATEXT(k) are assumed to have
k=1. Normally, k=1 as predicates usually test only the next lookahead symbol. Table 8 on
page 59 provides a few examples.
TABLE 8. Sample Predicates and Their Lookahead Contexts
Predicate Context
a : <<p(LT(1))>>? ID ; ID
ANTLR Reference
60 Language Translation Using PCCTS and C++ Initial Release
The third predicate in the table provides context information for the ID following the
LPAREN; hence, the context is LPAREN followed by ID. The other two examples require a
lookahead depth of k=1.
Predicates normally apply to exactly one lookahead sequence. ANTLR will give you a warning
for any predicate that applies to more than one sequence.
There are situations when you would not wish ANTLR to compute the lookahead context of
predicates: (i) when ANTLR would take too long to compute the context and (ii) when the
predicate applies only to a subset of the full context computed by ANTLR. In these situations,
a predicate context-guard is required, which allows you to specify the syntactic context under
which the predicate is valid. The form of a context-guarded predicate is
( context-guard )? => <<semantic-predicate>>?
Where the context-guard can be any grammar fragment that specifies a set of k-sequences
where k is the depth referenced in semantic-predicate. For example,
cast_expr
: ( LPAREN ID )? => <<isTypeName(LT(2))>>?
LPAREN abstract_type RPAREN
;
This predicate dictates that when the lookahead is LPAREN followed by ID, then the
isTypeName() predicate provides semantic information and may be evaluated. Without the
guard, ANTLR would assume that the lookahead was LPAREN followed by all tokens that
could begin an abstract_type.
Multiple lookahead k-sequences can also be specified inside the context guard:
a : ( ID | KEYWORD )? => <<predicate>>? b ;
The use of EBNF looping-constructs such as (...)* are not valid in context-guards. [We
anticipate a future version that would allow you to specify arbitrary lookahead sequences to
recognize language constructs like the C++ scope override; e.g., A::B::C::foo].
Because there may be more than one predicate visible to an alternative, ANTLR has rules for
combining multiple predicates.
1. Predicates or groups of predicates taken from alternative productions are ||’d
together.
2. Predicates or groups of predicates taken from the same production are &&’d together.
a : <<p(LT(1))>>? b ;
b : ID | INT ;
ID, INT
a : <<p(LT(2))>>? LPAREN ID ; LPAREN ID
TABLE 8. Sample Predicates and Their Lookahead Contexts
Predicate Context
Predicates
Language Translation Using PCCTS and C++ Initial Release 61
For example,
decl
: typename declarator “;”
| declarator “;”
;
declarator
: ID
;
typename
: classname
| enumname
;
classname
: <<isClass(LATEXT(1))>>? ID
;
enumname
: <<isEnum(LATEXT(1))>>? ID
;
The decision for the first alternative of rule decl would hoist both predicates and test them in
a decision similar to the following:
if ( LA(1)==ID && (isClass(LATEXT(1)||isEnum(LATEXT(1)) ) { ...
Adding a predicate to rule decl:
decl
: <<flag>>? typename declarator “;”
| declarator “;”
;
would result in flag being &&’d with the result of the combination of the other two predicates:
if (LA(1)==ID && (flag&&(isClass(LATEXT(1)||isEnum(LATEXT(1)))) { ...
In reality, ANTLR inserts code to ensure that the predicates are tested only when the parser
lookahead is consistent with the context associated with each predicate; here, all predicates
have ID as their context and the redundant tests have been removed for clarity.
2.6.1.3 Semantic Predicates Effect Upon Syntactic Predicates
During the evaluation of a syntactic predicate, semantic predicates that have been hoisted into
prediction expressions are still evaluated. Success or failure of these disambiguating predicates
simply alters the parse and does not directly cause syntax errors.
Validation predicates (those that have not been hoisted) are also still evaluated. However, a
failed validation predicate aborts the current syntactic predicate being evaluated whereas, nor-
mally, a failure causes a syntax error.
2.6.2 Syntactic Predicates
Just as semantic predicates indicate when a production is valid, syntactic predicates also indi-
cate when a production is acandidate for recognition. The difference lies in the type of infor-
mation used to predict alternative productions. Semantic predicates employ information about
ANTLR Reference
62 Language Translation Using PCCTS and C++ Initial Release
the meaning of the input (e.g., symbol table information) whereas syntactic predicates employ
structural information like normal LL(k) parsing decisions. Syntactic predicates specify a
grammatical construct that must be seen on the input stream for a production to be valid.
Moreover, this construct may match input streams that are arbitrarily long; normal LL(k) pars-
ers are restricted to using the next k symbols of lookahead. This section describes the form and
use of syntactic predicates.
2.6.2.1 Syntactic Predicate Form and Meaning
Syntactic predictions have the form
( α )? β
or, the shorthand form
( α )?
which is identical to
( α )? α
where α and β are arbitrary Extended BNF (EBNF) grammar fragments that do not define new
nonterminals. The meaning of the long form syntactic predicate is:
“If α is matched on the input stream, attempt to recognize β.”
Note the similarity to the semantic predicate
<<α>>? β
which means
“If α evaluates to true at parser run-time, attempt to match β.”
Syntactic predicates may only occur at the extreme left edge of alternatives because they are
only useful during the prediction of alternatives--not during the subsequent recognition of the
alternatives.
Alternative productions that are syntactically unambiguous, but non-LL(k), should be rewrit-
ten, left-factored, or modified to use syntactic predicates. Consider the following rule
type
: ID
| ID
;
The alternatives are syntactically ambiguous because they can both match the same input. The
rule is a candidate for semantic predicates, not syntactic. The following example is unambigu-
ous--it is just not deterministic to a normal LL(k) parser.
Let's now consider a small chunk of the vast C++ declaration syntax. Can you tell exactly what
type of object f is after having seen the left parenthesis?
int f(
The answer is “no.”. Object f could be an integer initialized to some previously defined sym-
bol a:
int f(a);
Predicates
Language Translation Using PCCTS and C++ Initial Release 63
or a function prototype or definition:
int f(float a) {...}
The following is a greatly simplified grammar for these two declaration types:
decl: type ID “\(“ expr_list “\)” “;”
| type ID “\(“ arg_decl_list “\)” func_def
;
One notices that left-factoring type “ID “\(“” would be trivial because our grammar is so
small and the left-prefixes are identical. However, if a user action were required before recog-
nition of the reference to rule type, left-factoring would not be possible:
decl
: <<// dummy init action; next action isn’t init action>>
<<printf(“var init\n”);>>
type ID “\(“ expr_list “\)” “;”
| <<printf(“func def\n”);>>
type ID “\(“ arg_decl_list “\)” func_def
;
The solution to the problem involves looking arbitrarily ahead (type could be arbitrarily big,
in general) to determine what appears after the left-parenthesis. This problem is easily solved
implicitly by using a syntactic predicate:
decl
: ( <<//dummy action>>
<<printf(“var init\n”);>>
type ID “\(“ expr_list “\)” “;”
)?
| <<printf(“func def\n”);>>
type ID “\(“ arg_decl_list “\)” func_def
;
The (...)? indicates that it is impossible to decide, from the left edge of rule decl with a
finite amount of lookahead, which production to predict. Any grammar construct inside a
(...)? block is attempted and, if it fails, the next alternative production that could match the
input is attempted. This represents selective backtracking and is similar to allowing ANTLR
parsers to guess without being “penalized” for being wrong. Note that the first action of any
block is considered an init action and hence, since it may define local variables it cannot be
gated out with an if-statement as any local variables would not be visible outside the if-
statement
2.6.2.2 Modified Parsing Strategy
Decisions that are not augmented with syntactic predicates are parsed deterministically with
finite lookahead up to depth k as is normal for ANTLR-generated parsers. When at least one
syntactic predicate is present in a decision, rule recognition proceeds as follows:
1. Find the first viable production; i.e., the first production in the alternative list pre-
dicted by the current finite lookahead, according to the associated finite-lookahead
prediction-expression.
ANTLR Reference
64 Language Translation Using PCCTS and C++ Initial Release
2. If the first grammar element in that production is not a syntactic predicate, predict
that production and go to 3 else attempt to match the predicting grammar fragment
of the syntactic predicate.
3. If the predicting grammar fragment is matched, predict the associated production
and go to 4 else find the next viable production and go to 2.
4. Proceed with the normal recognition of the production predicted in 2 or 3.
For successful predicates, both the predicting grammar fragment and the remainder of the pro-
duction are actually matched, hence, the short form, (α)?, actually matches α twice--once to
predict and once to apply α normally, executing any embedded actions.
2.6.2.3 Nested Syntactic Predicate Invocation
Because syntactic predicates may reference any defined nonterminal and because of the recur-
sive nature of grammars, it is possible for the parser to return to a point in the grammar which
had already requested backtracking. This nested invocation poses no problem from a theoreti-
cal point of view, but can cause unexpected parsing delays in practice if you are not careful.
2.6.2.4 Efficiency
The order of alternative productions in a decision is significant. Productions in a PCCTS
grammar are always attempted in the order specified. For example, the parsing strategy outline
above indicates that the following rule is most efficient when foo is less complex than bar.
a : (foo)?
| bar
;
because they testing the simplest possibility first is faster.
Any parsing decisions made inside a (..)? block are made deterministically unless they
themselves are prefixed with syntactic predicates. For example,
a : ( (A)+ X | (B)+ X )?
| (A)* Y
;
specifies that the parser should attempt to match the nonpredicated grammar fragment
( (A)+ X
| (B)+ X
)
using normal the normal finite-lookahead parsing strategy. If a phrase recognizable by this
grammar fragment is found on the input stream, the state of the parser is restored to what it was
before the predicate invocation and the grammar fragment is parsed again. Else, if the grammar
fragment failed to match the input, apply the next production in the outer block:
(A)* Y
2.6.2.5 Syntactic Predicates Effect Upon Actions and Semantic Predicates.
While evaluating a syntactic predicate, user actions, such as adding symbol table entries, are
not executed because in general, they cannot be “undone;” this conservative approach avoids
affecting the parser state in an irreversible manner. Upon successful evaluation of a syntactic
predicate, actions are once again enabled--unless the parser was in the process of evaluating
another syntactic predicate.
Parser Exception Handlers
Language Translation Using PCCTS and C++ Initial Release 65
Because semantic predicates are restricted to side-effect-free expressions, they are always eval-
uated when encountered. However, during syntactic predicate evaluation, the semantic predi-
cates that are evaluated must be functions of values computed when actions were enabled. For
example, if your grammar has semantic predicates that examine the symbol table, all symbols
needed to direct the parse during syntactic predicate evaluation must be entered into the table
before this backtracking phase has begun.
Because init-actions are always executed, it is possible to trick ANTLR into actually executing
an action during the evaluation of a syntactic predicate by simply enclosing the action in a sub-
rule:
(<<action>>)
2.6.2.6 Syntactic Predicates effect upon Grammar Analysis.
ANTLR constructs normal LL(k)decisions throughout predicated parsers, only resorting to
arbitrary lookahead predictors when necessary. Calculating the lookahead sets for a full LL(k)-
parser can be quite expensive, so that, by default, ANTLR uses a linear approximation to the
lookahead and only uses full LL(k) analysis when required. When ANTLR encounters a syn-
tactic predicate, it generates the instructions for selective backtracking as you would expect,
but also generates an approximate decision. Although no finite lookahead decision is actually
required (the arbitrary lookahead mechanism will accurately predict the production without it)
the approximate portion of the decision reduces the number of times backtracking is attempted
without hope of a successful match. An unexpected, but important benefit of syntactic predi-
cates is that they provide a convenient method for preventing ANTLR from attempting full
LL(k) analysis when doing so would cause unacceptable analysis delays.
2.7 Parser Exception Handlers
Parser exception handlers provide a more sophisticated alternative to the automatic error
reporting and recovery facility provided by ANTLR. The notion of throwing and catching
parser error signals is similar to C++ exception handling, however, our implementation allows
both the C and C++ interface to use parser exception handling. This section provides a terse
description of the syntax and semantics of ANTLR exceptions.
When a parsing error occurs, the parser throws an exception. The most recently encountered
exception handler that catches the appropriate signal is executed. The parse continues after
the exception by prematurely returning from the rule that handled the exception. Generally, the
rule that catches the exception is not the rule that throws the exception; e.g., a statement
rule may be a better place to handle an error than the bowels of an expression evaluator as the
statement rule has unambiguous context information with which to generate a good error
message and recover.
Exception handlers may be specified:
1. After any alternative. These handlers apply only to signals thrown while recogniz-
ing the elements of that alternative.
2. After the ‘;’ of a rule definition. These handlers apply to any signal thrown while
recognizing any alternative of the rule unless the handler references a element label,
in which case the handler applies only to recognition of that rule element. Non-
labeled handlers attached to a rule catch signals not caught by handlers attached to
an alternative.
ANTLR Reference
66 Language Translation Using PCCTS and C++ Initial Release
3. Before the list of rules. These global exception handlers apply when a signal is not
caught by a handler attached to a rule or alternative. Global handlers behave slightly
differently in that they are always executed in the rule that throws the signal; the rule
is still prematurely exited.
2.7.1 Exception Handler Syntax
The syntax for an exception group is as follows:
exception_group
: "exception" { ID } ( exception_handler )*
{ "default" ":" ACTION }
;
exception_handler
: "catch" SIGNAL ":" { ACTION }
;
where SIGNAL is one of:
A “default :” clause may also be used in your exception group to match any signal that
was thrown. [Currently, you cannot define your own exception signals.]
You can define multiple signals for a single handler. E.g.,
exception
catch MismatchedToken :
catch NoViableAlt :
catch NoSemViableAlt :
<<
printf("stat:caught predefined signal\n");
consumeUntil(DIE_set);
>>
If a label (attached to a rule reference) is specified for an exception group, that group may be
specified after the end of the ‘;’ rule terminator. Because element labels are unique per rule,
ANTLR can still uniquely identify which rule reference to associate the exception group with;
TABLE 9. Summary of Predefined Parser Exception Signals
Signal Name Description
NoViableAlt Exception thrown when none of the alternatives
in a pending rule or subrule were predicted by
the current lookahead.
NoSemViableAlt
Exception thrown when no alternatives were pre-
dicted in a rule or subrule and at least one seman-
tic predicate (for a syntactically viable
alternative) failed.
MismatchedToken Exception thrown when the pending token to
match did not match the first symbol of looka-
head.
Parser Exception Handlers
Language Translation Using PCCTS and C++ Initial Release 67
it often makes a rule cleaner to have most of the exception handlers at the end of the rule. For
example,
a : A t:expr B
| ...
;
exception[t]
catch ...
catch ...
The NoViableAlt signal only makes sense for labeled exception groups and for rule excep-
tion groups.
2.7.2 Exception Handler Order of Execution
Given a signal, S, the handler that is invoked is determined by looking through the list of
enabled handlers in a specific order. Loosely speaking, we say that a handler is enabled
(becomes active) and pushed onto an exception stack when it has been seen by the parser on its
way down the parse tree. A handler is disabled and taken off the exception stack when the
associated grammar fragment is successfully parsed. The formal rules for enabling are:
• All global handlers are enabled upon initial parser entry.
•
Exception handlers specified after an alternative become enabled when that alternative
is predicted.
• Exception handlers specified for a rule become enabled when the rule is invoked.
•
Exception handlers attrached with a label to a particular rule reference within an alterna-
tive are enabled just before the invocation of that rule reference.
Disabling rules are:
• All global handlers are disabled upon parser exit.
• Exception handlers specified after an alternative are disabled when that alternative has
been (successfully) parsed completely.
• Exception handlers specified for a rule become disabled just before the rule returns.
• Exception handlers tied to a particular rule reference within an alternative are disabled
just after the return from that rule reference.
Upon an error condition, the parser with throw an exception signal, S. Starting at the top of the
stack, each exception group is examined looking for a handler for S. The first S handler found
on the stack is executed. In practice, the run time stack and hardware program counter are used
to search for the appropriated handler. This amounts to the following:
1. If there is an exception specified for the enclosing alternative, then look for S in that
group first.
2. If there is no exception for that alternative or that group did not specify an S handler,
then look for S in the enclosing rule’s exception group.
3. Global handlers are like macros that are inserted into the rule exception group for
each rule.
4. If there is no rule exception or that group did not specify an S handler, then return
from the enclosing rule with the current error signal still set to S.
ANTLR Reference
68 Language Translation Using PCCTS and C++ Initial Release
5. If there is an exception group attached (via label) to the rule that just returned, check
that exception group for S.
6. If an exception group attached to a rule reference does not have an S handler, then
look for S in the enclosing rule’s exception group.
This process continues until an S handler is found or a return instruction is executed in starting
rule (in this case, the start symbol would have a return-parameter set to S).
These guidelines are best described with an example:
a : A c B
exception /* 1 */
catch MismatchedToken : <<ACTION1>>
| C t:d D
exception /* 2 */
catch MismatchedToken : <<ACTION2>>
catch NoViableAlt : <<ACTION3>>
;
exception[t] /* 3 */
catch NoViableAlt : <<ACTION4>>
exception /* 4 */
catch NoViableAlt : <<ACTION5>>
c : E ;
d : e ;
e : F | G
;
exception /* 5 */
catch MismatchedToken : <<ACTION6>>
The following table summarizes the sequence in which the exception groups are tested.
The global handlers are like macro insertions. For example:
exception catch NoViableAlt : <<blah blah>>
a : A
;
exception
catch MismatchedToken : <<ack;>>
b : B
;
TABLE 10. Example Order of Search For Exception Handlers
Input Exception group search sequence Action Executed
D E B 4 5
A E D 1 1
A F B 1 1
C F B 2 2
C E D 5, 2 3
Parser Exception Handlers
Language Translation Using PCCTS and C++ Initial Release 69
This grammar fragment is functionally equivalent to:
a : A
;
exception
catch MismatchedToken : <<ack;>>
catch NoViableAlt : <<blah blah>>
b : B
;
exception
catch NoViableAlt : <<blah blah>>
2.7.3 Modifications to Code Generation
The following items describe the changes to the output parser C or C++ code when at least one
exception handler is specified:
• Each rule reference acquires a signal parameter which returns 0 if no error occurred
during that reference or it returns a nonzero signal S.
• The MATCH() macro throws MismatchedToken rather than calling zzsyn()--the
standard error reporting and recovery function.
• When no viable alternative is found, NoViableAlt is signaled rather than calling the
zzsyn() routine.
•
The parser no longer resynchronizes automatically. See “Resynchronizing the Parser”
on page 70.
2.7.4 Semantic Predicates and NoSemViableAlt
When the input stream does not predict any of the alternatives in the current list of possible
alternatives, NoViableAlt is thrown. However, what happens when semantic predicates are
specified in that alternative list? There are cases where it would be very misleading to just
throw NoViableAlt when in fact one or more alternatives were syntactically viable; i.e., the
reason that no alternative was predicted was due to a semantic invalidity--a different signal
must be thrown. For example,
expr : <<P1>>? ID ... /* function call */
| <<P2>>? ID ... /* array reference */
| INT
;
exception
catch NoViableAlt :
<<no ID or INT was found>>
catch NoSemViableAlt :
<<an ID was found, but it was not valid>>
Typically, you would want to give very different error messages for the two different situations.
Specifically, giving a message such as
syntax error at ID missing { ID INT }
would be very misleading (i.e., wrong).
The rule for distinguishing between NoViableAlt and NoSemViableAlt is:
ANTLR Reference
70 Language Translation Using PCCTS and C++ Initial Release
If NoViableAlt would be thrown and at least one semantic predicate
(for a syntactically viable alternative) failed, signal NoSemVi-
ableAlt instead of NoViableAlt.
[Semantic predicates that are not used to predict alternatives do not yet throw signals. You
must continue to use the fail-action attached to individual predicates in these cases.]
2.7.5 Resynchronizing the Parser
When an error occurs while parsing rule R, the parser will generally not be able to continue
parsing anywhere within that rule--it will return immediately after executing any exception
code. The one exception is for handlers attached to a particular rule reference. In this case, the
parser knows exactly where in the alternative you would like to continue parsing from--specif-
ically, immediately after the rule reference.
After reporting an error, your handler must resynchronize the parser by consuming zero or
more tokens. More importantly, this consumption must be appropriate given the point where
the parser will attempt to continue parsing. For example, given when an error occurs during the
recognition of the conditional of an if-statement, a good way to recover would be to consume
tokens until the then is found on the input stream.
stat : IF e:expr THEN stat
;
exception[e]
default : <<print error; consumeUntilToken(THEN);>>
The parser will continue with the parse after the expr reference (since we attached the excep-
tion handler to the rule reference) and look for the then right away.
To allow this type of manual resynchronization of the parser, two functions are provided:
TABLE 11. Resynchronization Functions
Function Description
consumeUntil(X_set) Consume tokens until a token in the
token class X is seen. Recall that
ANTLR generates a packed bitset
called X_set for each token class X.
The C interface prefixes the function
name with “zz”.
consumeUntilToken(T) Consume tokens until token T is seen.
The C interface prefixes the function
name with “zz”.
ANTLR Command Line Arguments
Language Translation Using PCCTS and C++ Initial Release 71
For example,
#tokclass RESYNCH { A C }
a : b A
| b C
;
b : B
;
exception
catch MismatchedToken : // consume until FOLLOW(b)
<<print error message; zzconsumeUntil(RESYNCH_set);>>
You may also use function set_el(T, TC_set) (prefix with “zz” in C interface) to test
token type T for membership in a token class TC. For example,
<<if ( zzset_el(LA(1), TC_set) ) blah blah blah;>>
2.7.6 The @ Operator
You may suffix any token reference with the '@' operator, which indicates that if that token is
not seen on the input stream, errors are to be handled immediately rather than throwing a Mis-
matchedToken exception. In particular, [for the moment] the macros zzmatch_wd-
fltsig() or zzsetmatch_wdfltsig() is called in both C and C++ mode for simple
token or token class references. In C++, you can override functions ANTLRParser member
functions _match_wdfltsig() and _setmatch_wdfltsig().
The ‘@’ operator may also be placed at the start of any alternative to indicate that all token ref-
erences in that alternative (and enclosed subrules) are to behave as if they had been suffixed
with the ‘@’ operator individually.
[‘@’ was the only punctuation symbol available for this operator; there is no predefined rela-
tionship between the operation and the ‘@’ symbol that we are taking advantage of--in fact, it
classes with the lexical end-of-file regular expression symbol.]
2.8 ANTLR Command Line Arguments
ANTLR understands the following command line arguments:
-CC Generate C++ output from ANTLR.
-ck n Use up to n symbols of lookahead when using compressed (linear approxima-
tion) lookahead. This type of lookahead is very cheap to compute and is
attempted before full LL(k) lookahead, which is of exponential complexity in the
worst case. In general, the compressed lookahead can be much deeper (e.g, -ck
10) than the full lookahead (which usually must be less than 4).
-cr Generate a cross-reference for all rules. For each rule, print a list of all other rules
that reference it.
-e1 Ambiguities/errors shown in low detail (default).
-e2 Ambiguities/errors shown in more detail.
-e3 Ambiguities/errors shown in excruciating detail.
-fe f file Rename err.c to f.
ANTLR Reference
72 Language Translation Using PCCTS and C++ Initial Release
-fh f file Rename stdpccts.h header (turns on -gh) to f.
-fl f file Rename lexical output, parser.dlg, to f.
-fm f file Rename file with lexical mode definitions, mode.h, to f.
-fr f file Rename file which remaps globally visible symbols, remap.h, to f.
-ft f file Rename tokens.h to f.
-gc Indicates that antlr should generate no C code, i.e., only perform analysis on the
grammar.
-gd C/C++ code is inserted in each of the ANTLR generated parsing functions to pro-
vide for user-defined handling of a detailed parse trace. The inserted code con-
sists of calls to the user-supplied macros or functions called zzTRACEIN and
zzTRACEOUT in C and calls to ANTLRParser::tracein() and trace-
out() in C++. The only argument is a char * pointing to a C-style string
which is the grammar rule recognized by the current parsing function. If no defi-
nition is given for the trace functions, upon rule entry and exit, a message will be
printed indicating that a particular rule as been entered or exited.
-ge Generate an error class for each rule.
-gh
Generate stdpccts.h for non-ANTLR-generated files to include. This file
contains all defines needed to describe the type of parser generated by antlr (e.g.
how much lookahead is used and whether or not trees are constructed) and con-
tains the header action specified by the user. If your main() is in another file,
you should include this file in C mode. C++ can ignore this option.
-gk Generate parsers that delay lookahead fetches until needed. Without this option,
ANTLR generates parsers which always have k tokens of lookahead available.
This option is incompatible with semantic predicates and renders references to
LA(i) invalid as one never knows when the i
th
token of lookahead will be
fetched. [This is currently broken in C++ mode.]
-gl Generate line info about grammar actions in the generated C/C++ code of the
form
# line “file.g”
which makes error messages from the C/C++ compiler make more sense as they
will point into the grammar file not the resulting C/C++ file. Debugging is easier
as well, because you will step through the grammar not C/C++ file.
-gs Do not generate sets for token expression lists; instead generate a “||”-separated
sequence of LA(1)==token_number. The default is to generate sets.
-gt Generate code for Abstract-Syntax Trees.
-gx Do not create the lexical analyzer files (dlg-related). This option should be given
when the user wishes to provide a customized lexical analyzer. It may also be
used in make scripts to cause only the parser to be rebuilt when a change not
affecting the lexical struc- ture is made to the input grammars.
-k n Set k of LL(k) to n; i.e., set the number of tokens of look-ahead (default==1).
-o dir Directory where output files should go (default=”.”). This is very nice for keeping
the source directory clear of ANTLR and DLG spawn.
DLG Command Line Arguments
Language Translation Using PCCTS and C++ Initial Release 73
-p The complete grammar, collected from all input grammar files and stripped of all
comments and embedded actions, is listed to stdout. This is intended to aid in
viewing the entire grammar as a whole and to elim- inate the need to keep actions
concisely stated so that the grammar is easier to read.
-pa This option is the same as -p except that the output is annotated with the first sets
determined from grammar analysis.
-prc on Turn on the computation of predicate context (default is not to compute the con-
text).
-prc off Turn off the computation and hoisting of predicate context (default case).
-rl n
Limit the maximum number of tree nodes used by grammar analysis to n. Occa-
sionally, ANTLR is unable to analyze a grammar submitted by the user. This rare
situation occurs when the grammar is large and the amount of lookahead is
greater than one. A nonlinear analysis algorithm is used by PCCTS to handle the
general case of LL(k) parsing. The average complexity of analysis, however, is
near linear due to some fancy footwork in the implementation which reduces the
number of calls to the full LL(k) algorithm. An error message will be displayed,
if this limit is reached, which indicates the grammar construct being analyzed
when ANTLR hit a nonlinearity. Use this option if ANTLR seems to go out to
lunch and your disk start thrashing; try n=80000 to start. Once the offending con-
struct has been identified, try to remove the ambiguity that antlr was trying to
overcome with large lookahead analysis. The introduction of (...)? backtrack-
ing predicates eliminates some of these problems--antlr does not analyze alterna-
tives that begin with (...)? (it simply backtracks, if necessary, at run time).
-w1 Set low warning level. Do not warn if semantic predi- cates and/or (...)? blocks
are assumed to cover ambigu- ous alternatives.
-w2 Ambiguous parsing decisions yield warnings even if semantic predicates or (...)?
blocks are used. Warn if predicate context computed and semantic predicates
incompletely disambiguate alternative productions.
- Read grammar from standard input and generate stdin.c as the parser file.
2.9 DLG Command Line Arguments
These are the command line arguments understood by DLG (normally, you can ignore these
and concentrate on ANTLR):
-CC Generate C++ output. The output file is not specified in this case.
-Clevel Where level is the compression level used. 0 indications no compression, 1
removes all unused characters from the transition from table, and 2 maps equiva-
lent characters into the same character classes. It is suggested that level -C2 is
used, since it will significantly reduce the size of the DFA produced for lexical
analyzer.
-m f Produces the header file for the lexical mode with a name other than the default
name of mode.h.
ANTLR Reference
74 Language Translation Using PCCTS and C++ Initial Release
-i An interactive, or as interactive as possible, scanner is produced. A character is
only obtained when required to decide which state to go to. Some care must be
taken to obtain accept states that do not require look ahead at the next character to
determine if that is the stop state. Any regular expression with a “e*” at the end
is guaranteed to require another character of lookahead.
-cl class Specify a class name for DLG to generate. The default is DLGLexer.
-ci The DFA will treat upper and lower case characters identically. This is accom-
plished in the automaton; the characters in the lexical buffer are unmodified.
-cs Upper and lower case characters are treated as distinct. This is the default.
-o dir
Directory where output files should go (default=”.”). This is very nice for keeping
the source directory clear of ANTLR and DLG spawn.
-Wambiguity
Warns if more than one regular expression could match the same character
sequence. The warnings give the numbers of the expressions in the DLG lexical
specification file. The numbering of the expressions starts at one. Multiple warn-
ings may be print for the same expressions.
-
Used in place of file names to get input from stdin or send output to stdout.
2.10 C Interface
[The C interface was not designed--it gradually evolved from a simplistic attributed-parser
built as a course project written in 1988. Unfortunately, for backward compatibility reasons,
the interface has been augmented but not changed in any significant way. Readers looking at
this description should take this fact into consideration.]
The C interface parsing model assumes that a scanner (normally built by DLG) returns the
token type of tokens found on the input stream when it is asked to do so by the parser. The
parser provides attributes, which are computed from the token type and text of the token, to
grammar actions to facilitate translations. The line and column information are directly
accessed from the scanner. The interface requires only that you define what an attribute looks
like and how to construct one from the information provided by the scanner. Given this infor-
mation, ANTLR can generate a parser that will correctly compile and recognize sentences in
the prescribed language.
The type of an attribute must be called Attrib; the function or macro to convert the text and
token type of a token to an attribute is called zzcr_attr().
This chapter describes the invocation of C interface parsers, the special symbols and functions
available to grammatical actions, the definition of attributes, and the definition of AST nodes.
2.10.1 Invocation of C Interface Parsers
C interface parsers are invoked via the one of the macros defined in Table 12 on page 75.
The rule argument must be a valid C function call, including any parameters required by the
starting rule. For example, to read an expr from stdin:
ANTLR(expr(), stdin);
C Interface
Language Translation Using PCCTS and C++ Initial Release 75
To read an expr from string buf:
char buf[] = “3+4”;
ANTLRs(expr(), buf);
To read an expr and build an AST:
char buf[] = “3+4”;
AST *root;
ANTLRs(expr(&root), buf);
To read an expr and build an AST where expr has a single integer parameter:
#define INITIAL 0
char buf[] = “3+4”;
AST *root;
ANTLRs(expr(&root,INITIAL), buf);
A simple template for a C interface parser is the following:
#header <<
#include “charbuf.h”
>>
#token “[\ \t]+” <<zzskip();>>
#token “\n” <<zzskip(); zzline++;>>
<<
main() { ANTLR(start(), stdin); }
>>
start : ;
TABLE 12. C Interface Parser Invocation Macros
Macro Description
ANTLR(r,f) Begin parsing at rule r, reading characters from
stream f.
ANTLRm(r,f,m) Begin parsing at rule r, reading characters from
stream f; begin in lexical class m.
ANTLRf(r,f) Begin parsing at rule r, reading characters by calling
function f for each character.
ANTLRs(r,s) Begin parsing at rule r, reading characters from string
s.
ANTLR Reference
76 Language Translation Using PCCTS and C++ Initial Release
2.10.2 Functions and Symbols in Lexical Actions
Table 13 on page 76 describes the functions and symbols available to actions that are executed
uon the recognition of an input token (however, in rare cases, these functions need to be called
from within a grammar action).
TABLE 13. C Interface Symbols Available to Lexical Actions
Symbol Description
zzreplchar(char c) Replace the text of the most recently matched
lexical object with c.
zzreplstr(char c) Replace the text of the most recently matched
lexical object with c.
int zzline
The current line number being scanned by DLG.
This variable must be maintained by the user;
this variable is normally maintained by incre-
menting it upon matching a newline character.
zzmore()
This function merely sets a flag that tells DLG to
continue looking for another token; future char-
acters are appended to zzlextext.
zzskip()
This function merely sets a flag that tells DLG to
continue looking for another token; future char-
acters are not appended to zzlextext.
zzadvance() Instruct DLG to consume another input charac-
ter. zzchar will be set to this next character.
int zzchar The most recently scanned character.
char *zzlextext The entire lexical buffer containing all characters
matched thus far since the last token type was
returned. See zzmore() and zzskip().
NLA To change token type to t, do “NLA = t;”.
This feature is not really needed anymore as
semantic predicates are a more elegant means of
altering the parse with run time information.
NLATEXT To change token type text to foo, do
“strcpy(NLATEXT,foo);”.
This feature sets the actual token lookahead
buffer, not the lexical buffer zzlextext.
char *zzbegexpr Beginning of last token matched.
char *zzendexpr End of last token matched.
ZZCOL Define this preprocessor symbol to get DLG to
track the column numbers.
int zzbegcol The column number starting from 1 of the first
character of the most recently matched token.
C Interface
Language Translation Using PCCTS and C++ Initial Release 77
2.10.3 Attributes Using the C Interface
Attributes are objects that are associated with all tokens found on the input stream. Typically,
attributes represent the text of the input token, but may include any information that you
require. The type of an attribute is specified via the Attrib type name, which you must pro-
vide. A function zzcr_attr() is also provided by you to inform the parser how to convert
from the token type and text of a token to an Attrib. [In early versions of ANTLR, attributes
were also used to pass information to and from rules or subrules. Rule arguments and return
values are a more sophisticated mechanism and, hence, in this section, we will pretend as if
attributes are only used to communicate with the scanner.]
2.10.3.1 Attribute Definition and Creation
The attributes associated with input tokens must be a function of the text and the token type
associated with that lexical object. These values are passed to zzcr_attr() which com-
int zzendcol The column number starting from 1 of the last
character of the most recently matched token.
Reset zzendcol to 0 when a newline is
encountered. Adjust zzendcol in the lexical
action when a character is not one print position
wide (e.g. tabs or non-printing characters). The
column information is not immediately updated
if a token's action calls zzmore().
void (*zzerr)(char *) You can set zzerr to point to a routine of your
choosing to handle lexical errors (e.g., when the
input does not match any regular expression).
zzmode(int m) Set the lexical mode (i.e., lexical class or autom-
aton) corresponding to a lex class defined in an
ANTLR grammar with the #lexclass direc-
tive. Yes, this is very poorly named.
int zzauto What automaton (i.e., lexical mode) is DLG in?
zzrdstream(FILE *) Specify that the scanner should read characters
from the stream argument.
zzclose_stream() Close the current stream.
zzrdstr(zzchar_t *) Specify that the scanner should read characters
from the string argument.
zzrdfunc(int (*)()) Specify that the scanner should obtain characters
by calling the indicated function.
zzsave_dlg_state(
struct zzdlg_state *)
Save the current state of the scanner. This is use-
ful for scanning nested includes files etc...
zzrestore_dlg_state(
struct zzdlg_state *)
Restore the state of the scanner from a state
buffer.
TABLE 13. C Interface Symbols Available to Lexical Actions
Symbol Description
ANTLR Reference
78 Language Translation Using PCCTS and C++ Initial Release
putes the attribute to be associated with that token. The user must define a function or macro
that has the following form:
void zzcr_attr(attr, type, text)
Attrib *attr; /* pointer to attribute associated with this lexeme */
int type; /* the token type of the token */
char *text; /* text associated with lexeme */
{
/* *attr = f(text,token); */
}
Consider the following Attrib and zzcr_attr() definition.
typedef union {
int ival; float fval;
} Attrib;
zzcr_attr(Attrib *attr, int type, char *text)
{
switch ( type ) {
case INT : attr->ival = atoi(text); break;
case FLOAT : attr->fval = atof(text); break;
}
}
The typedef specifies that attributes are integer or floating point values. When the regular
expression for a floating point number (which has been identified as FLOAT) is matched on the
input, zzcr_attr() converts the string of characters representing that number to a C
float. Integers are handled analogously.
The programmer specifies the C definition or #includes the file needed to define Attrib
(and zzcr_attr() if it is a macro) using the ANTLR #header directive. The action asso-
ciated with #header is placed in every C file generated from the grammar file(s). Any C file
created by the user that includes antlr.h must once again define Attrib before #inclu-
de’ing antlr.h. A convenient way to handle this is to use the -gh ANTLR command line
option to have ANTLR generate the stdpccts.h file and then simply include stdpc-
cts.h.
2.10.3.2 Attribute References
Attributes are referenced in user actions as $label where label is the label of a token refer-
enced anywhere before the position of the action. For example,
#header <<
typedef int Attrib;
#define zzcr_attr(attr, type, text) *attr = atoi(text);
>>
#token "[\ \t\n]+" <<zzskip();>> /* ignore whitespace */
add : a:"[0-9]+" "\+" b:"[0-9]+"
<<printf("addition is %d\n", a+b);>>
;
C Interface
Language Translation Using PCCTS and C++ Initial Release 79
If Attrib is defined to be a structure or union, then $label.field is used to access the
various fields. For example, using the union example above,
#header <<
typedef union { ... };
>>
void zzcr_attr(...) { ... };
#token "[\ \t\n]+" <<zzskip();>> /* ignore whitespace */
add : a:INT "\+" b:FLOAT
<<printf("addition is %f\n", $a.ival+$b.fval);>>
;
For backward compatibility reasons, ANTLR still supports the notation $i and $i.j where i
and j are a positive integers. The integers uniquely identify an element within the currently
active block and within the current alternative of that block. With the invocation of each new
block, a new set of attributes becomes active and the previously active set is temporarily inac-
tive. The $i and $i.j style attributes are scoped exactly like local stack-based variables in C.
Attributes are stored and accessed in stack fashion. With the recognition of each element in a
rule, a new attribute is pushed on the stack. Consider the following simple rule:
a: B | C ;
Rule a has 2 alternatives. $i refers to the ith rule element in the current block and within the
same alternative. So, in rule a, both B and C are $1.
Subrules are like code blocks in C--a new scope is exists within the subrule. The subrules
themselves are counted as a single element in the enclosing alternative. For example,
b : A ( B C <<action1>> | D E <<action2>> ) F <<action3>>
| G <<action4>>
;
The following table describes the attributes that are visible to each action.
2.10.3.3 Attribute destruction
The user may elect to “destroy” all attributes created with zzcr_attr(). A macro or func-
tion called zzd_attr(), is executed once for every attribute when the attribute goes out of
scope. Deletions are done collectively at the end of every block. zzd_attr() is passed the
address of the attribute to destroy. This can be useful when memory is allocated with
zzcr_attr() and needs to be free()ed. For example, sometimes zzcr_attr() needs
to make copies of some lexical objects temporarily. Rather than explicitly inserting code into
the grammar to free these copies, zzd_attr() can be used to do it implicitly. This concept is
similar to the constructors and destructors of C++.
TABLE 14. Visibility and Scoping of Attributes
Action Visible Attributes
action1 B as $1 (or $2.1), C as $2 (or $2.2), A as $1.1
action2 D as $1, E as $2, A as $1.1
action3 A as $1, F as $3
action4 G as $1
ANTLR Reference
80 Language Translation Using PCCTS and C++ Initial Release
Consider the case when attributes are character strings and copies of the lexical text buffer are
made which later need to be deallocated. This can be accomplished with code similar to the
following.
#header <<
typedef char *Attrib;
#define zzd_attr(attr) {free(*(attr));}
>>
<<
zzcr_attr(Attrib *attr, int type, char *text)
{
if ( type == StringLiteral ) {
*attr = malloc( strlen(text)+1 );
strcpy(*attr, text);
}
}
>>
2.10.3.4 Standard Attribute Definitions
Some typical attribute types are defined in the PCCTS include directory. These standard
attribute types are contained in the following include files:
•