Conference PaperPDF Available

PLCC: A programming language compiler compiler

March 2014

Conference: Proceedings of the 45th ACM technical symposium on Computer science education

Authors:

State University of New York at Potsdam

This paper describes PLCC, a compiler-compiler tool to support courses in programming languages, compilers, and computational theory. This tool has proven to be useful for implementing interpreters, building compilers, and creating parsers for context-free languages. PLCC is a Python program that takes an input file that specifies the tokens, syntax, and semantics of a language and that generates a complete set of Java files that implement the semantics of the language. PLCC stands for "Programming Language Compiler-Compiler". PLCC is not intended to be a production-quality tool. Rather, it supports understanding and implementing the essential elements of lexical analysis, parsing, and semantics without having to wrestle with the complexities of dealing with "industrial-strength" compiler-compiler tools. Students quickly learn how to write PLCC "grammar" files for small languages that have straightforward syntax and semantics and use PLCC to build Java-based parsers, interpreters, or compilers for these languages that run out-of-the-box. Input to PLCC is a text file with a token definition section that defines language tokens as simple regular expressions, a syntax section that specifies the grammar rules of an LL(1) language as simple Backus-Naur Form (BNF) productions, and a semantics section that defines the language semantics as Java methods. PLCC generates a set of Java source files that are entirely self-contained and that import only standard elements of 'java.util' in JDK5 and above. For running programs, PLCC generates a read-eval-print loop that (1) reads standard input, (2) scans, parses, and evaluates the input, and (3) prints the evaluation to standard output. PLCC can be obtained at the following URL: https://drive.google.com/drive/folders/1gmgRxBGQb0u64qblR5_YdR40C92kyRl6?usp=sharing

Content uploaded by Timothy V. Fossum

Content may be subject to copyright.

PLCC: A Programming Language Compiler Compiler

Timothy Fossum

Computer Science Department

SUNY College at Potsdam

Potsdam, NY 13676

fossumtv@potsdam.edu

ABSTRACT

This paper describes PLCC, a compiler-compiler tool to sup-

port courses in programming languages, compilers, and com-

putational theory. This tool has proven to be useful for

implementing interpreters, building compilers, and creating

parsers for context-free languages.

PLCC is a Perl program that takes an input ﬁle that speci-

ﬁes the tokens, syntax, and semantics of a language and that

generates a complete set of Java ﬁles that implement the se-

mantics of the language. PLCC stands for “Programming

Language Compiler-Compiler”.

PLCC is not intended to be a production-quality tool.

Rather, it supports understanding and implementing the es-

sential elements of lexical analysis, parsing, and semantics

without having to wrestle with the complexities of dealing

with “industrial-strength” compiler-compiler tools. Students

quickly learn how to write PLCC “grammar” ﬁles for small

languages that have straightforward syntax and semantics

and use PLCC to build Java-based parsers, interpreters, or

compilers for these languages that run out-of-the-box.

Input to PLCC is a text ﬁle with a token deﬁnition section

that deﬁnes language tokens as simple regular expressions, a

syntax section that speciﬁes the grammar rules of an LL(1)

language as simple Backus-Naur Form (BNF) productions,

and a semantics section that deﬁnes the language semantics

as Java methods.

PLCC generates a set of Java source ﬁles that are entirely

self-contained and that import only standard elements of

java.util in JDK5 and above. For testing purposes, PLCC

generates a read-eval-print loop that (1) reads standard in-

put, (2) scans, parses, and evaluates the input, and (3) prints

the evaluation to standard output.

Categories and Subject Descriptors

D.3.4 [Software]: Programming Languages—processors

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

SIGCSE’14, March 3–8, 2014, Atlanta, GA, USA.

http://dx.doi.org/10.1145/2538862.2538922.

General Terms

Languages

Keywords

Compiler-compiler, parser, interpreter, syntax, semantics

1. INTRODUCTION

My decision to write PLCC was inspired by the book Es-

sentials of Programming Languages (EOPL) by Friedman,

Wand, and Haynes[4]. I had used EOPL for more than ten

years to teach an upper-level course in Programming Lan-

guages. Since Java is the principal language our students

take in introductory courses, I found the time required for

them to learn Scheme, the implementation language used in

EPOL, ate into the time I wanted to cover other topics in

my Programming Languages class.

What I liked about EOPL was the emphasis on writing

simple interpreters for languages that grow incrementally

in complexity as the course progresses. Many of these lan-

guages are purely functional, giving students much-needed

exposure to a diﬀerent programming paradigm. My ITiCSE

paper[2] describes how I use the EOPL approach to deﬁne

a language with classes as ﬁrst class objects, which is based

on material in Chapter 5 (Objects and Classes) of EOPL.

EOPL comes with a Scheme tool set called sllgen, a

parser-generator written in Scheme. Input to sllgen con-

sists of speciﬁcations (represented as Scheme lists) of a lan-

guage’s lexical structure and grammar. Sllgen uses the

lexical speciﬁcation to generate a scanner for the language

and uses the grammar speciﬁcation (which must be LL(1)[1,

Chapter 5]) to generate a set of Scheme “datatypes” that im-

plement the elements of a parse tree for a program in the

language. The sllgen-generated recursive descent parser

returns an instance of the datatype for the start symbol of

the grammar; this instance is the root of the parse tree for

the input program in the language. An sllgen-generated

read-eval-print loop reads a program in the language, to-

kenizes and parses it, evaluates the resulting parse tree based

on semantics provided by the implementer in an eval pro-

cedure, and prints the result.

The “datatypes” generated by sllgen are akin to Pas-

cal’s variant records[6]. When two or more grammar rules

have the same left-hand-side nonterminal, the sllgen parser

chooses the correct grammar rule to apply based on the cur-

rent token (using the LL(1) properties of the language) and

returns an instance of the nonterminal’s datatype that iden-

tiﬁes what grammar rule was used in the parse. Scheme

procedures like eval that perform semantic actions on these

datatypes must use a cases construct – like a switch in Java

or C – to identify the particular instance of the datatype

and to carry out an appropriate semantic action on that in-

stance. Writing eval code using a cases construct can be

messy. Each of the grammar rules having the same left-

hand-side nonterminal requires a cases entry to code the

eval behavior of a particular variant of the datatype. This

makes coding cumbersome, and it doesn’t facilitate a clean

physical or conceptual separation of responsibilities to han-

dle the eval semantics that the diﬀerent right-hand-sides

represent.

I observed that processing the variants of a datatype could

be accomplished through dynamic dispatch in an object-

oriented language. Instead of using a cases construct to

eval a Scheme datatype representing a node in a parse tree,

I could use inheritance to dispatch an eval method call on

the Java object that represents the particular node in the

parse tree. Dynamic dispatch has the advantage that the

eval code for the object representing a node in the parse

tree node is unique to the object.

I also wanted to represent the lexical and grammar speci-

ﬁcations of a language in a way that is more straightforward

than using Scheme lists. In particular, I wanted to use regu-

lar expressions[3] to deﬁne the tokens in the language and to

use conventional Backus-Naur Form (BNF) style for specify-

ing the grammar rules. Finally, I wanted the entire language

speciﬁcation to be monolithic, with the lexical, grammar,

and semantic speciﬁcations all in one ﬁle.

I liked being able to implement the increasingly complex

sequence of languages as given in the EOPL book. I wanted

to ensure that the PLCC project would allow me to process

these languages as I had done using Scheme and sllgen.

My goals in this project were thus to:

•Target Java as the implementation language

•Use a single ﬁle to specify the language’s lexical, gram-

mar, and semantic structure

•Use regular expressions to specify the lexical structure

of a language

•Use BNF to deﬁne the grammar rules of a language

•Generate recursive-descent parsing code that is easy to

read and understand

•Make a clear separation of semantics from the syntax

•Make use of dynamic dispatch to implement seman-

tic actions corresponding to diﬀerent grammar rules

having the same left-hand-side nonterminal

•Be capable of handling language examples in EOPL

•Use standard Perl and Java to implement the project

After building PLCC and using it in my Programming

Languages course based on EOPL, I found that PLCC would

also serve as a framework for teaching a course on compilers.

For my Compilers course, I replace the simple PLCC token

scanner based on regular expressions with a separately built

token scanner based on a ﬁnite-state machine (also using

an object-oriented approach) that can handle, for example,

comments that cross line boundaries. When used to gener-

ate a compiler, PLCC handles target code generation in the

same way it handles evaluation semantics for an interpreter,

except that the “value” of a program in the source language

is a generated program in the target language (I use assem-

bly language), and the read-eval-print loop is replaced by a

read-parse-generate method.

A bare-bones PLCC speciﬁcation without any evaluation

semantics (other than returning “pass” or “fail”) can be used

to generate parsers for many context-free languages encoun-

tered in a Computational Theory class. It’s easy to con-

struct a simple speciﬁcation ﬁle for a language with a small

number of tokens and grammar rules that PLCC can turn

into a read-eval-print loop to check for membership in the

language.

2. THE PLCC TOOL

2.1 Lexical speciﬁcations

The EOPL sllgen toolkit deﬁnes the lexical speciﬁcation

of a language in terms of a Scheme list. The following is

a lexical speciﬁcation in sllgen that skips whitespace and

comments (beginning with the ‘%’ character) and that iden-

tiﬁes patterns for variables (var) and numeric literals (lit):

(define the-lexical-spec

’((whitespace

(whitespace) skip)

(comment

("%" (arbno (not #\newline))) skip)

(var

(letter (arbno (or letter digit))) var)

(lit

(digit (arbno digit)) lit)))))

In PLCC, I represent these patterns as regular expres-

sions. Here is the the same lexical speciﬁcation written in

PLCC, which I regard as more straightforward. This ap-

proach also gives students the opportunity to learn about

regular expressions, which can prove to be useful to them as

they pursue a computing-related career.

skip WHITESPACE ’\s+’

skip COMMENT ’%.*’

LIT ’\d+’

VAR ’[a-zA-Z]\w*’

PLCC assumes that tokens do not cross line boundaries.

This means that a pattern such as ‘%.*’ will terminate at the

end of the current line. This assumption makes token pro-

cessing simpler, but it does mean that the PLCC-generated

scanner cannot handle multi-line skips or tokens. To al-

low for more general lexical analysis behavior, PLCC can

be directed to skip generating the scanner-related classes,

allowing the user to provide an externally written scanner.

As noted above, I use this in my compiler class, where we

spend time implementing a stand-alone scanner.

Sllgen gathers reserved words from the grammar rules, so

the following sllgen grammar rule would result in deﬁning

tokens corresponding to ‘if’, ‘then’, and ‘else’:

(exp

("if" exp "then" exp "else" exp)

if-exp)

In PLCC, these tokens need to be deﬁned explicitly in the

lexical speciﬁcation section:

IF ’if’

THEN ’then’

ELSE ’else’

The PLCC-generated scanner skips input that matches

the lexical skip deﬁnitions. It then returns the next input

token by examining all of the token deﬁnitions, in the order

given in the speciﬁcation ﬁle, and determining which ones

match one or more characters of the current input. The

scanner returns the token with the longest match: among

matches of the same length, it returns the token correspond-

ing to the ﬁrst deﬁnition it encounters. This means that the

lexeme ‘if’ would be returned as an IF token, not a VAR

token, with the following token speciﬁcations:

skip WHITESPACE ’\s+’

skip COMMENT ’%.*’

IF ’if’

THEN ’then’

ELSE ’else’

LIT ’\d+’

VAR ’[a-zA-Z]\w*’

When I ask students to add a new reserved word such

as ‘while’ to the token speciﬁcations, some will add it after

the VAR speciﬁcation, so ‘while’ would be regarded as a VAR

instead of a WHILE: both patterns match the string ‘while’,

and both matches have have the same length, but VAR comes

ﬁrst. This provides a good learning opportunity: a test pro-

gram that should work with a while will not parse properly

if while is treated as a VAR. Some students who make this

error ask why their solutions don’t work, and they have an

“aha” experience when they discover – often with my help

– why not, and how to ﬁx it by simply moving a line in

the token speciﬁcations. Others who submit incorrect solu-

tions without even writing test programs are greeted with a

pointed query asking if they tested their solutions.

2.2 Grammar Rules

Sllgen grammar rules are given as a Scheme list. Here is

an example subset of grammar rules for expressions in one

of the EOPL languages:

(exp (lit) lit-exp)

(exp (var) var-exp)

(exp

("if" exp "then" exp "else" exp)

if-exp)

(exp

("let" (arbno var "=" exp) "in" exp)

let-exp)

(exp

("proc" "(" (separated-list var ",") ")" exp)

proc-exp)

The ﬁrst two grammar rules above can be written in PLCC

as follows:

<exp>:LitExp ::= <LIT>

<exp>:VarExp ::= <VAR>

The arbno construct in sllgen corresponds to a “Kleene

star” repetition construct in Extended BNF. PLCC does not

allow a Kleene star operator to be used in the middle of

the right-hand-side of a grammar rule. Instead, a separate

grammar rule construct, introduced by ‘**=’, must be used

to identify a rule where the entire right-hand-side can appear

zero or more times. For example, the (arbno var "=" exp)

construct can be written as a separate rule in PLCC as fol-

lows:

<letDecls> **= <VAR> EQUALS <exp>

Armed with this, the sllgen grammar let-exp rule can be

written in PLCC as shown here:

<exp>:LetExp ::= LET <letDecls> IN <exp>

<letDecls> **= <VAR> EQUALS <exp>

The separated-list construct in sllgen identiﬁes a rule

where items can appear zero or more times – similar to

arbno – but the items must be separated by a speciﬁed to-

ken if there are more than one of them. This is used, in

proc-exp for example, when specifying function parameters

as a comma-separated list:

(separated-list var ",")

PLCC uses ‘**=’ to introduce such a construct, with the

separator token appearing at the end preceded by a +sign:

<formals> **= <VAR> +COMMA

Aproc-exp in PLCC will then become

<exp>:ProcExp ::= PROC

LPAREN <formals> RPAREN <exp>

<formals> **= <VAR> +COMMA

(Note that the ﬁrst two lines above appear folded to ﬁt the

column width. In the PLCC ﬁle, these would appear on one

line.)

2.3 Classes generated from grammar rules

PLCC generates a Java class for each left-hand-side non-

terminal given in the grammar rules. Nonterminals in the

grammar rules section must begin with a lower-case letter

and can be followed by any number of additional letters,

digits, or underscores. The name of the Java class gener-

ated by PLCC is the same as the name of the nonterminal,

except with its ﬁrst letter converted to uppercase. From

the nonterminals <exp>,<letDecls> and <formals> PLCC

generates the classes Exp,LetDecls and Formals.

When a language has two or more grammar rules with

the same left-hand-side nonterminal, the left-hand-side non-

terminal class is declared as abstract, and subclasses are

created for each of these grammar rules. The name of a

particular subclass is speciﬁed in the grammar rule by the

class name that follows a colon ‘:’ after the nonterminal.

For example, the following grammar rules

<exp>:LitExp ::= <LIT>

<exp>:VarExp ::= <VAR>

deﬁne classes named LitExp and VarExp that extend the ab-

stract class Exp. PLCC ensures that there is a one-to-one

correspondence between the generated non-abstract class

names and the grammar rules.

2.4 Class ﬁelds

As described above, every grammar rule line deﬁnes a

unique non-abstract Java class. Each such class has a num-

ber of public ﬁelds corresponding to the items on the right-

hand-side of the rule. Only those right-hand side items that

appear in angle brackets ‘<...>’ have ﬁelds deﬁned in the

class. If the item is a nonterminal such as <exp>, the corre-

sponding ﬁeld is named exp and has type Exp. If the item is

a terminal such as <VAR>, the corresponding ﬁeld is named

var and has type Token. The class has a single constructor

that assigns its arguments to these ﬁelds.

For example, from the grammar rule

<exp>:LetExp ::= LET <letDecls> IN <exp>

PLCC generates a LetExp class having two ﬁelds, a con-

structor, and a static parse method:

// <exp>:LetExp ::= LET <letDecls> IN <exp>

public class LetExp extends Exp {

public LetDecls letDecls;

public Exp exp;

public LetExp (LetDecls letDecls,

Exp exp) {

this.letDecls = letDecls;

this.exp = exp;

}

public static LetExp parse(Scan scn) {

scn.match(Token.Val.LET);

LetDecls letDecls = LetDecls.parse(scn);

scn.match(Token.Val.IN);

Exp exp = Exp.parse(scn);

return new LetExp(letDecls, exp);

}

...

}

The static LetExp.parse method returns an instance of

this class by processing, in order, the items in the right-hand-

side of the grammar rule: matching the terminal LET, calling

the LetDecls.parse method, matching the terminal IN, and

calling the Exp.parse method. The resulting letDecls and

exp values are used to construct and return a LetExp object.

PLCC declares its generated class ﬁelds to be public.

While this practice can be considered unsafe, it makes cod-

ing semantic methods more straightforward.

A Java class generated by a grammar rule with repetitions

– one that uses the **= repetition construct – can have zero

or more ﬁeld instances corresponding to items on its right-

hand-side. For such a class, the values are collected into

ArrayList ﬁelds whose names are the same as for a non-

repeating rule, with the string ‘List’ appended. For exam-

ple, the following grammar rule

<letDecls> **= <VAR> EQUALS <exp>

generates a Java class named LetDecls having a ﬁeld

named varList of type ArrayList<Token> and a ﬁeld named

expList of type ArrayList<Exp>. A successful call to the

parse method on this class returns an instance of a LetDecls

object: the varList will be populated with a number of

Token objects, and the expList will be populated with the

same number of Exp objects. Similar remarks apply to re-

peating rules with a separator.

Both of the repeating rules can be replaced with suitably

chosen recursive grammar rules: PLCC makes such replace-

ments internally, but only to check for LL(1). However,

there are signiﬁcant advantages to using repeating rules:

•The resulting parse trees for grammars using repeating

rules are shallower than those using recursive grammar

rules, since the ArrayLists ﬂatten out the parse tree.

•The parse methods for a class deﬁned by a repeating

rule can employ a loop instead of recursive calls, re-

sulting in run-time improvements in space and time.

The same remark applies to methods that implement

repeating rule semantics.

•Repeating rules are easier to read and understand than

their recursive counterparts: there are fewer produc-

tions, and it’s easy to spot the repeating parts.

•The ArrayList ﬁelds in a repeating rule conveniently

package the repeating elements of a parse in a way that

is more direct and compact than spreading them out in

the parse tree. Processing these ArrayLists to carry

out semantic actions is straightforward using iterators.

2.5 Parsing

Parsing a program that conforms to a PLCC grammar

speciﬁcation is easy: call the parse method on the class

generated by the start symbol of the language, which is al-

ways the ﬁrst left-hand-side nonterminal appearing in the

grammar rules. The parse method is static in all of the

PLCC-generated classes. Each parse method is passed a

Scan object (see Section 3 below) that delivers tokens for

parsing. For an abstract class such as Exp, the parse method

returns a parsed instance of one of its subclasses: the current

input token determines the appropriate grammar rule to ap-

ply (based on the LL(1) property of the grammar) which in

turn determines the appropriate subclass to instantiate. For

a non-abstract class such as LetDecls, the parse method

returns an instance of the class itself.

When the top-level parse completes successfully (excep-

tions can occur when there is a syntax error or when the

Scan class is unable to deliver a token) the result is an in-

stance of the Java class corresponding to the start symbol.

In many EOPL languages, this is an instance of the Program

class.

The following Java code represents the essence of PLCC-

generated scanning and parsing, returning an instance of the

Program class, the class associated with the start symbol

<program>:

Program.parse(new Scan(System.in));

2.6 Semantics

The PLCC-generated read-eval-print loop parses a pro-

gram in the language by calling the static parse method on

the Java class generated by the start symbol of the language.

This is done before any semantics are applied to the result-

ing parse. Thus the entire program is scanned and parsed

prior to carrying out any semantic actions.

The default PLCC semantics of a program is to print the

Java String value of the parse. In other words, the entirety

of the default PLCC lexical analysis, parsing, and seman-

tics is embodied in the following one-liner (folded for your

viewing pleasure):

System.out.println(

Program.parse(

new Scan(System.in)

)

);

In the absence of overriding the toString method of a

Program object, the above statement will produce something

Program@768965fb

which simply says that the default semantics of this program

is a Program object.

In its simplest form, implementing the non-default seman-

tics of a PLCC language consists of overriding the default

toString method in the start symbol class. Here is a gram-

mar fragment (the tokens speciﬁcations are similar to those

given above) for a simple language:

<exp>:LitExp ::= <LIT>

<exp>:VarExp ::= <VAR>

To override the default toString behavior in the Program

class, we create a code fragment that deﬁnes a toString

method and associate it with the Program class as follows:

Program

%%{

public String toString() {

return exp.eval();

}

%%}

The ﬁrst line, beginning with Program, identiﬁes the Java

class whose source code will be modiﬁed, and the items be-

tween the lines %%{ and %%} will be added to the code already

in the Program.java source ﬁle. (In the above example, re-

call that exp is a ﬁeld in the Program class that is populated

by its parse method. The eval method will be deﬁned in

the subclasses of the Exp class.) This code, and all other

code that deﬁnes the language semantics of the grammar

rules, appears in the PLCC language speciﬁcation ﬁle fol-

lowing the grammar rules for the language, after a line with

a single ‘%’.

An entire Java class source ﬁle can be created in the se-

mantics section by naming the class (as long as it is diﬀerent

from the PLCC-generated classes) and including the code to

be inserted into the class between %%{ and %%} lines. This

is useful when Java classes other than those automatically

generated by PLCC are needed to implement language se-

mantics.

Continuing the example above, the eval semantics for a

LitExp might be given as follows:

LitExp

%%{

public String eval() {

return lit.toString();

}

%%}

In this example, lit is the only ﬁeld in the LitExp class:

it has type Token, and evaluating lit.toString() returns

the string value of the token as it appears in the program

source.

2.7 Putting it all together

The three sections of a PLCC speciﬁcation – lexical, gram-

mar, and semantics – appear in one ﬁle, with the sections

separated by a line with a single ‘%’ sign. Comments can ap-

pear in the PLCC lexical and grammar speciﬁcations start-

ing with a #and continuing to the end of the line.

Here is a complete speciﬁcation example. The evaluation

semantics of a numeric literal is the literal itself, and the

evaluation semantics of a variable symbol is the uppercase

version of the symbol.

# a simple language with numeric literals

# and variable symbols

# lexical specification

skip WHITESPACE ’\s+’

skip COMMENT ’%.*’

LIT ’\d+’

VAR ’[a-zA-Z]\w*’

# grammar rules

<exp>:LitExp ::= <LIT>

<exp>:VarExp ::= <VAR>

# semantics

Program

%%{

public String toString() {

return exp.eval();

}

%%}

Exp

%%{

public abstract String eval();

%%}

LitExp

%%{

// return the literal string

public String eval() {

return lit.toString();

}

%%}

VarExp

%%{

// return the symbol in uppercase

public String eval() {

return var.toString().toUpperCase();

}

%%}

2.8 Read-eval-print

PLCC automatically generates a read-eval-print Rep class

whose main method repeatedly prints a prompt, creates an

instance of the Scan class from System.in, and parses and

prints the grammar start symbol class. A sample interac-

tion running the Java Rep program using the above language

speciﬁcation looks as follows:

--> 42

--> xyZzY % should print XYZZY

XYZZY

3. PLCC ARCHITECTURE

I chose Perl to write PLCC because Perl has good string

handling and pattern matching capabilities. The output of

PLCC is a set of Java programs, all of which are text ﬁles

and are generated in a subdirectory of the current directory

named Java.

PLCC reads the token speciﬁcations and creates a ﬁle

named Token.java that contains enum entries for both skip

tokens and normal tokens. PLCC generates these entries

from the regular expressions given in the token speciﬁca-

tions, turning them into Java Strings with appropriate es-

caping. PLCC uses a template ﬁle called Token.pattern

as the basis for ﬁlling in the appropriate token deﬁnitions

drawn from the speciﬁcation ﬁle to create the Token.java

ﬁle. The end of the token speciﬁcation is a line with a single

PLCC then reads the grammar rules and creates entries

for each class, noting which classes must be abstract (be-

cause the nonterminal appears more than once on the left-

hand-side of the grammar rules), and keeping track of the

corresponding right-hand-sides. It checks the grammar for

being LL(1) and reports an error if not. Each repetition

grammar rule (with **=) is turned into multiple rules that

use recursion instead of repetition, but only for the purpose

of checking for LL(1).

Once the grammar is determined to be LL(1), PLCC gen-

erates class stubs for each of the grammar rules, as well as

aRep.java ﬁle to implement the read-eval-print loop. The

class stubs include generated code for the parse methods

speciﬁc to the grammar rules.

The Rep.java ﬁle is built from a template that only needs

to have the name of the start symbol class ﬁlled in. The end

of the grammar section is a line with a single %.

PLCC then reads the semantics speciﬁcation entries, each

of which starts with a class name followed by lines sand-

wiched between lines containing %%{ and %%}. If the class

name is one of the stubbed classes, the given lines are in-

serted into the stub class verbatim. If the class name is not

one of the stub classes, PLCC creates a new class containing

the given lines.

The Scan.java program is part of the PLCC standard

code library and is automatically included in the generated

code directory. Using the generated Token class, the Scan

class does all of the dirty work of reading lines of the in-

put stream (a BufferedReader), skipping over input that

matches the skip speciﬁcations, and returning the next to-

ken.

All of the Java source ﬁles created by PLCC are deposited

into a subdirectory named Java. Once these Java ﬁles are

generated and compiled (errors in creating the semantic rou-

tines can be uncovered here), the Rep program can be run

to test the resulting language implementation.

In order to make it easier to deal with the separate parts of

the semantics of a PLCC speciﬁcation, an include directive

can be used in the semantics speciﬁcation section, giving the

name of the ﬁle to include in the speciﬁcation. The contents

of this ﬁle then become part of the PLCC speciﬁcation input.

This is useful, for example, to separate code that implements

semantics speciﬁc to grammar classes from code that is used

to implement auxiliary classes used in semantic actions. For

particularly complex languages, it may be useful to have

several of these ﬁles.

4. COMPARISON WITH OTHER PARSER

GENERATORS

A Wikipedia comparison of parser generators[7] for de-

terministic context-free languages lists about 90 entries.

Parsers for non-LL(1) languages are typically table-driven

or backtracking, and the code generation for such languages

is much more diﬃcult for students to read and understand

than the code for simple LL(1) predictive parsers (as gener-

ated by PLCC) based on recursive descent. Of the entries in

this list, 12 of them clearly target languages that are LL(1)

or that generate recursive descent parsers. Of these, only

Coco/R[5] targets Java (the latest version also targets C#

and C++).

However, Coco/R mixes syntax and semantics: seman-

tic actions are speciﬁed in-line with the grammar rules, so

Coco/R does not satisfy my goal of clearly separating syn-

tax and semantics. Furthermore, the format of specifying

Coco/R grammar rules is more complex than the simple

BNF-style used in PLCC. Getting started using Coco/R is

more diﬃcult and time-consuming than learning how to use

PLCC. I conclude that PLCC meets my course-related goals

in a way that no other parser-generator does.

5. COURSE CONSIDERATIONS

I have used PLCC in two oﬀerings of a Programming Lan-

guages course at my institution, and I plan to continue its

use when I oﬀer this course in the future. I have also used

PLCC as the compiler-compiler for an oﬀering of a Compil-

ers class.

I have observed that my students learn how to use PLCC

quickly, more so than when I was using sllgen and needed to

cover the elements of Scheme from scratch. I have been able

to cover more and richer language examples once I started

using PLCC. Students coming into my Programming Lan-

guages class have experience with Java, but they generally

do not have a good understanding of abstract classes: using

PLCC gives them the opportunity to become familiar with

abstract classes and how to use them.

6. PLCC AVAILABILITY

The entire PLCC toolkit consists of the plcc Perl program

(1320 lines) and a set of template ﬁles used to to generate the

scanner and read-eval-print loop (a total of 371 lines). These

can all be downloaded from http://cs.potsdam.edu/PLCC.

7. REFERENCES

[1] C. Fischer and R. LeBlanc Jr. Crafting a Compiler with

C. Benjamin/Cummings, Redwood City, CA, 1991.

[2] T. Fossum. Classes as ﬁrst-class objects in an

environment-passing interpreter. In Proceedings of the

Tenth Annual Conference on Innovation and

Technology in Computer Science Education

(Lisbon,Portugal), pages 261-265. ACM, 2005.

[3] J. Friedl. Mastering Regular Expressions. O’Reilly

Media, Sebastapol, CA, 2006.

[4] D. Friedman, M. Wand, and C. Haynes. Essentials of

Programming Languages (2nd ed). The MIT Press,

Cambridge, Massachusetts, 2001.

[5] H. M¨

ossenb¨

ock. A generator for production quality

compilers. In Springer Verlag Lecture Notes in

Computer Science, 477:42-55, 1990.

[6] Pascal: ISO Standard 7185. 1990. Retrieved December

2, 2013 from

http://pascal-central.com/docs/iso7185.pdf.

[7] Wikipedia.Org. 2013. Comparison of Parser Generators.

Retrieved September 5, 2013 from

http://en.wikipedia.org/wiki/.

A Project-based Learning Experience in a Compilers Course

Conference Paper

Apr 2019

This paper describes a project-based learning (PBL) experience in a compilers course. In PBL, students play an active goal in learning and professors act like facilitators of knowledge. In PBL, students face authentic and motivating problems that require them to answer to complex questions and develop success skills. We first explain why the majority of projects used in the compilers course are not fit for this teaching strategy. Based on this problem, we propose a project that enables student motivation and sustained inquiry. We describe a one-semester experience with two professors and 40 students. In the experience described, students were asked to work in groups to build a complete compiler for a language designed on their own. Furthermore, we designed different types of classes, such as traditional lectures, time in the lab, group meetings and design discussions to enable student voice, reflection, critical thinking and critique, essential elements of PBL that are commonly not sufficiently addressed in traditional course organizations. The results show that students were highly motivated and capable of identifying which success skills needed improvement.

A Generator for Production Quality Compilers.

Conference Paper

Full-text available

Jan 1990

Hanspeter Mössenböck

This paper presents a compiler description language and its implementation Coco/R (Compiler Compiler for Recursive Descent). Coco/R reads an attributed EBNF grammar of a language and translates it into a recursive descent parser and a scanner for that language. The programmer has to supply a main program that calls the parser and semantic modules that are called from within the parser. Coco/R evolved from two predecessors: the scanner generator Alex [Möss86] and the parser generator Coco [ReMö89]. Their input languages were merged and simplified due to our experiences with these tools over several years (a similar tool with a slightly different motivation also emerged from Alex and Coco [DoPi90]). Using Coco/R, compilers can be generated that are as efficient as hand-coded and carefully optimized production quality compilers. Almost as important as efficiency is the simplicity and adequacy of the system. Programmers are not willing to use a tool if it does not come in handy to their work, if it uses an arcane notation or a bulk of options and special cases. Coco/R puts simplicity and efficiency over power.

Classes as first class objects in an environment-passing interpreter

Conference Paper

Full-text available

Sep 2005

Timothy V. Fossum

We describe an expression-based programming language that treats classes as first-class objects. We show an implementation of this language using an environment-passing interpreter accessible to students in a programming language class. We also show how to extend this language with properties (as in the C# programming language).

Mastering Regular Expressions

Book

Aug 2006

Jeffrey E F Friedl

Regular expressions are a central element of UNIX utilities like egrep and programming languages such as Perl. But whether you're a UNIX user or not, you can benefit from a better understanding of regular expressions since they work with applications ranging from validating data-entry fields to manipulating information in multimegabyte text files. Mastering Regular Expressions quickly covers the basics of regular-expression syntax, then delves into the mechanics of expression-processing, common pitfalls, performance issues, and implementation-specific differences. Written in an engaging style and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions offers a wealth information that you can put to immediate use. Regular expressions are an extremely powerful tool for manipulating text and data. They are now standard features in a wide range of languages and popular tools, including Perl, Python, Ruby, Java, VB.NET and C# (and any language using the .NET Framework), PHP, and MySQL. If you don't use regular expressions yet, you will discover in this book a whole new world of mastery over your data. If you already use them, you'll appreciate this book's unprecedented detail and breadth of coverage. If you think you know all you need to know about regular expressions, this book is a stunning eye-opener. As this book shows, a command of regular expressions is an invaluable skill. Regular expressions allow you to code complex and subtle text processing that you never imagined could be automated. Regular expressions can save you time and aggravation. They can be used to craft elegant solutions to a wide range of problems. Once you've mastered regular expressions, they'll become an invaluable part of your toolkit. You will wonder how you ever got by without them. Yet despite their wide availability, flexibility, and unparalleled power, regular expressions are frequently underutilized. Yet what is power in the hands of an expert can be fraught with peril for the unwary. Mastering Regular Expressions will help you navigate the minefield to becoming an expert and help you optimize your use of regular expressions. Mastering Regular Expressions , Third Edition, now includes a full chapter devoted to PHP and its powerful and expressive suite of regular expression functions, in addition to enhanced PHP coverage in the central "core" chapters. Furthermore, this edition has been updated throughout to reflect advances in other languages, including expanded in-depth coverage of Sun's java.util.regex package, which has emerged as the standard Java regex implementation. Topics include: A comparison of features among different versions of many languages and tools How the regular expression engine works Optimization (major savings available here!) Matching just what you want, but not what you don't want Sections and chapters on individual languages Written in the lucid, entertaining tone that makes a complex, dry topic become crystal-clear to programmers, and sprinkled with solutions to complex real-world problems, Mastering Regular Expressions , Third Edition offers a wealth information that you can put to immediate use. Reviews of this new edition and the second edition: "There isn't a better (or more useful) book available on regular expressions." --Zak Greant, Managing Director, eZ Systems "A real tour-de-force of a book which not only covers the mechanics of regexes in extraordinary detail but also talks about efficiency and the use of regexes in Perl, Java, and .NET...If you use regular expressions as part of your professional work (even if you already have a good book on whatever language you're programming in) I would strongly recommend this book to you." --Dr. Chris Brown, Linux Format "The author does an outstanding job leading the reader from regex novice to master. The book is extremely easy to read and chock full of useful and relevant examples...Regular expressions are valuable tools that every developer should have in their toolbox. Mastering Regular Expressions is the definitive guide to the subject, and an outstanding resource that belongs on every programmer's bookshelf. Ten out of Ten Horseshoes." --Jason Menard, Java Ranch

Crafting A Compiler with C

Book

Jan 1991

Essentials of programming languages.

Book

Jan 1992

Comparison of Parser Generators

Jan 2013

Wikipedia
Org

Wikipedia.Org. 2013. Comparison of Parser Generators. Retrieved September 5, 2013 from http://en.wikipedia.org/wiki/.

Jan 2001

D Friedman
M Wand
C Haynes

D. Friedman, M. Wand, and C. Haynes. Essentials of Programming Languages (2nd ed). The MIT Press, Cambridge, Massachusetts, 2001.

PLCC: A programming language compiler compiler

Abstract

Recommended publications

Performance Evaluation of the Arpeggio Parser

From Regexes to Parsing Expression Grammars

Variables and Reversibility in Object Oriented Regular Expressions

ОБЪЕКТНО-ОРИЕНТИРОВАННЫЙ ПОДХОД К ОПИСАНИЮ И РАСШИРЕНИЮ СИНТАКСИСА ЯЗЫКОВ ПРОГРАММИРОВАНИЯ

Metamodel-based Language Definition with Python