Martin Sulzmann
Compiler pipeline:
Syntax Analysis ---> Semantic Analysis ---> Optimization --> Codegen
Ex1: x + * z
syntax error
Ex2: x + true semantic error
can't mix Int and Bool
Ex3: f(1) + f(1) all good assuming let x = f(1) Native, JVM, ...
f of type Int -> Int in x + x
Matching versus parsing.
Matching: Check if the program is syntactically valid.
Parsing: Match and transform the program (=string) into a more suitable representation for further analysis (AST = Abstract Syntax Tree).
Parsing is largely a solved problem.
There exists many parsing tools (e.g. consider yacc and ANTLR).
The state of the art tool ANTLR is based on the idea of top-down parsing.
ANTLR requires the user to specify the syntax (to be parsed) in terms of a form of context-free grammars (CFGs).
From the CFG, ANTLR produces a parser (Java program) that then parses the prescribed syntax. Hence, we say ANTLR is a parser-parser.
We do not consider a specific parsing tool. Rather, we study the main idea behind top-down parsing.
Transform CFG rules into a suitable form. This may not always possible.
Turn CFG rules in suitable form into (Go) parsing functions.
The last step from CFG rules to Go is done by hand. Tools such as yacc and ANTLR automate this step.
Use combinators to build a parser (parser combinators)
Combinators are fancy API (parsing) functions which hide much of the plumbing necessary to carry out parsing.
Like in case of a handwritten parser, we might need to bring CFG rules into a suitable form so that we can apply parser combinators.
There exists lots of parser combinator libraries for most programming languages. We we will implement our own parser combinator library in Go.
Compared to parsing tools such as yacc and ANTLR, the parser is embedded into the host language. So, parser combinators form an internal domain-specific language (DSL) (aka embedded DSL) whereas yacc and ANTLR are external DSLs.
Parser combinators are also a good example for
playing with higher-order functions,
example of an embedded (internal) domain-specific language (DSL), and
DSL separated from host language via types.
GPL = general-purpose language (C, Java, Kotlin, Haskell, Go, Python, …)
DSL = domain-specific language (SQL, Latex, bash, perl, …)
DSLs are usually external languages. Come with their own syntax, interpreter, compiler, … (“eco-system”).
Internal DSLs (a.k.a. embedded DSLs aka APIs aka Frameworks). Make use of the host language (GPL) to provide similar functionality like a external DSL. Often easier to evolve and adapt. Can reuse existing eco-system. Might to be not as optimized as external DSL. Other issues, domain-specific error handling etc.
How to describe and verify that a sequence of symbols consisting of
digits, +, * etc is a syntactically
well-formed arithmetic expression?
Examples.
1, 2, ...
1 + 2
(1 + 3) * 2
...
Let’s be more formal.
E is an expression if and only if (iff)
E is a number
E is either of the form E1+E2, E1*E2 where E1 and E2 are expressions.
E is of the form (E1) where E1 is an expression.
Observation. We can use (context-free grammar) rules to describe expressions.
E -> 0
E -> 1
...
...
E -> E + E
E -> E * E
E -> (E)
We refer to E -> E + E as a CFG rule.
Each rule consists of left-hand side and a right-hand side (separated
by ->).
The symbols appearing on some left-hand side are non-terminal symbols. Symbols that only appear on right-hand sides are terminal symbols. The left-hand side consists of exactly one non-terminal symbol whereas the right-hand side may consist of an arbitrary sequences of non-terminal and terminal symbols. The right-hand side may also be (the) empty (string).
Rules are applied by replacing (reducing) the left-hand side by the
right-hand side. For each CFG, there is a designated starting symbol.
For our example, we assume the starting symbol E.
A derivation is a sequence of reduction steps where we
assume that we always start with the starting symbol E.
A word is a sequence of terminal symbols. For example,
consider 1 + 1 + 0.
A word is part of the language described by the CFG, if we find a
derivation starting with E that ends with the word.
Consider the following derivation.
E
-> E + E (by applying the CFG rule E -> E + E)
-> E * E + E (by applying the CFG rule E -> E * E on the 'first' E in 'E + E')
-> E * E + 0 (by applying the CFG rule E -> 0 on the 'third' E in 'E * E + E')
-> 1 * E + 0 (and so on)
-> 1 * 1 + 0
We find that 1 * 1 + 0 is a word in the language
described by the above CFG.
For further details see here.
The rules for numbers are tedious.
E -> N
N -> 0
...
N -> 9
N -> 0N
...
N -> 9N
Could be much simplified if we use regular expressions.
E -> ['0'-'9']+
where we make use of the following short-hands.
r+ = r r*.
['0'-'9'] is a short-form for
'0'|...|'9'.
There are a number of further short-hands and notations to make context-free grammars more accessible. See Extended Backus–Naur form.
We loosely follow EBNF. For brevity, we only consider the numbers 0, 1 and 2.
N ::= 0 | 1 | 2
E ::= N | (E) | E + E | E * E
N and E are non-terminal symbols and 0, 1, 2, (, ), + and *
are terminal symbols. The above EBNF rules are simply a more
compact representation of the CFG rules we have used so far. For
example, in EBNF we may combine the right-hand sides of rules via
| (alternative) if the left-hand sides are the same.
The EBNF grammar is ambiguous. Ambiguous means here that for the same word, we find two derivations using the above rules.
(1)
E
-> E + E
-> N + E
-> 1 + E
-> 1 + E * E
-> 1 + N * E
-> 1 + 2 * E
-> 1 + 2 * N
-> 1 + 2 * 1
(2)
E
-> E * E
-> E + E * E
-> 1 + E * E
-> 1 + 2 * E
-> 1 + 2 * 1
What is the consequence of ambiguity? In case of (1), we interpret “1
+ 2 * 1” as 1 + (2 * 1). In case of (2), we interpret “1 +
2 * 1” as (1 + 2) * 1.
Unambiguity of a grammar is therefore important. Otherwise, compiler A (Oracle) might follow the interpretation (1) and compiler B (Microsoft) follows the interpretation (2). The same program will be interpreted differently!
How to resolve the ambiguity for our example? In general, we assume
that * binds tighter than +. This can be
enforced via the following grammar.
E ::= E + T | T
T ::= T * F | F
F ::= N | (E)
Recall the above example.
E
-> E + T
-> E + T * F
-> T + T * F
-> F + T * F
-> F + F * F
-> N + F * F
-> 1 + F * F
-> 1 + N * F
-> 1 + 2 * F
-> 1 + 2 * N
-> 1 + 2 * 1
Interpretation is 1 + (2 * 1).
A derivation, sequence of CFG rule applications, can be represented
as a parse tree. Derivations are formal proofs to verify for
example that 1 + (2 * 1) is word in the language denoted by
our running CFG.
Parse trees are compact representations for such proofs and are more suitable compared to derivations for processing in a programming language.
From derivations to parse trees:
Each rule consists of a left-hand side (non-terminal) and a right-hand side (mix of terminals and non-terminals).
We view each non-terminal as an intermediate node in a tree. Each terminal is a leaf node.
The left-hand side terminal is therefore the parent node. The children are found on the right-hand side.
Consider
E ::= E + T
viewed as a tree
E
/ | \
E + T
Consider the derivation.
E
-> T
-> F
-> (E)
-> (E + T)
-> (T + T)
-> (F + T)
-> (F + F)
-> (N + F)
-> (N + N)
-> (1 + 0)
Represented as a parse tree.
E
|
T
|
F
/ | \
( E )
/ | \
E + T
| |
T F
| |
F N
| |
N 0
|
1
The above parse tree represents the concrete syntax where all details of the original expression remain visible. We refer to the above as a concrete parse tree.
In further transformation steps, not all (concrete) syntactic details are necessary. For example, intermediate steps like “E reduced to T” are not relevant. Furthermore, we can ignore parentheses as they are (implicitly) represented by the structure of the tree.
The thus simplified tree is referred to as the abstract parse tree (aka abstract syntax tree = AST) and captures the abstract syntax of expressions.
+
/ \
1 0
In the AST representation we furthermore make use of binary terminal symbols in intermediate nodes.
We make use of algebraic data types to represent ASTs. For our expression language, ASTs need to support the following cases:
Numbers
Booleans
Binary operations for multiplication, addition, conjunction and disjunction.
Go does not directly support algebraic data types and pattern matching over algebraic data types. However, both features can be emulated via interfaces/methods.
We introduce the interface
We only assume a pretty (print) method. This method
transforms an AST expression back into an expression using the concrete
syntax.
The AST cases are represented via named types and (overloaded)
definitions of pretty.
We only consider the two cases of numbers and multiplication.
type Num int
type Mult [2]Exp
func (x Num) pretty() string {
return strconv.Itoa(int(x))
}
func (e Mult) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "*"
x += e[1].pretty()
x += ")"
return x
}The remaining cases are straightforward.
Parentheses (see pretty) are necessary! The pretty
printed string shall match the AST.
Without parentheses pretty applied on
// See example ex1 below
*
/ \
+ 0
/ \
1 2
yields "1+2*0". Assuming that * binds
tighter than +, the AST obtained from "1+2*0"
has the form
+
/ \
1 *
/ \
2 0
package main
import "fmt"
import "strconv"
// Simple expression language
// We only consider the AST representation.
// Interface
type Exp interface {
pretty() string
}
// Cases
type Bool bool
type Num int
type Mult [2]Exp
type Plus [2]Exp
type And [2]Exp
type Or [2]Exp
// pretty print
func (x Bool) pretty() string {
if x {
return "true"
} else {
return "false"
}
}
func (x Num) pretty() string {
return strconv.Itoa(int(x))
}
func (e Mult) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "*"
x += e[1].pretty()
x += ")"
return x
}
func (e Plus) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "+"
x += e[1].pretty()
x += ")"
return x
}
func (e And) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "&&"
x += e[1].pretty()
x += ")"
return x
}
func (e Or) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "||"
x += e[1].pretty()
x += ")"
return x
}
// Helper functions to build ASTs by hand
func number(x int) Exp {
return Num(x)
}
func boolean(x bool) Exp {
return Bool(x)
}
func plus(x, y Exp) Exp {
return (Plus)([2]Exp{x, y})
// The type Plus is defined as the two element array consisting of Exp elements.
// Plus and [2]Exp are isomorphic but different types.
// We first build the AST value [2]Exp{x,y}.
// Then cast this value (of type [2]Exp) into a value of type Plus.
}
func mult(x, y Exp) Exp {
return (Mult)([2]Exp{x, y})
}
func and(x, y Exp) Exp {
return (And)([2]Exp{x, y})
}
func or(x, y Exp) Exp {
return (Or)([2]Exp{x, y})
}
// Examples
func run(e Exp) {
fmt.Printf("\n ******* ")
fmt.Printf("\n %s", e.pretty())
}
func ex1a() {
ast := plus(mult(number(1), number(2)), number(0))
run(ast)
}
func ex1b() {
ast := mult(plus(number(1), number(2)), number(0))
run(ast)
}
func ex2() {
ast := and(boolean(false), number(0))
run(ast)
}
func ex3() {
ast := or(boolean(false), number(0))
run(ast)
}
func main() {
fmt.Printf("\n")
ex1a()
ex1b()
ex2()
ex3()
}Given some set of CFG rules.
Given some word.
Check if the word is part of the language denoted by the CFG. If yes, produce a parse tree.
See Cocke–Younger–Kasami algorithm.
Cubic complexity.
Deterministic in linear time.
Generally, requires to rewrite the grammar to a suitable form. This is not always possible. So, not every grammar is suitable for top-down parsing.
Recall
N ::= 0 | 1 | 2
E ::= N | (E) | E + E | E * E
This grammar is not suitable for parsing because the grammar is ambiguous. See above discussion.
The grammar below is unambigous and accepts the same set of words (language) as the above grammar.
E ::= E + T | T
T ::= T * F | F
F ::= N | (E)
N ::= 0 | 1 | 2
The idea of top-down parsing:
For the above grammar, this is tricky because of left-recursion.
Consider
E ::= E + T | T
How to predict if we should favor the right-hand side
E + T or the favor the right-hand side T?
We might have to backtrack! This is bad because backtracking commonly implies some exponential run-time.
What to do? The idea is to (yet again) rewrite the grammar and remove the left-recursion.
Here is the resulting grammar.
E ::= T E2
E2 ::= + T E2 |
T ::= F T2
T2 ::= * F T2 |
F ::= N | (E)
N ::= 0 | 1 | 2
How to derive a top-down parser based on this grammar? Each left-hand side corresponds to a function where the function body is defined by the right-hand side. We explain this idea in the following.
Each non-terminal implies a function of the same name.
Successfully executing the function implies that we have recognized a word that is derivable via this non-terminal.
The function body is defined by the right-hand side of the grammar rule.
Consider
E ::= T E2
The above translates to
We assume here that functions T and E2 only
accepts words that can be derived via the respective non-terminals.
Consider
T2 ::= * F T2 |
In this case there are two alternative right-hand sides. The first
case (* F T2) implies that we first need to check that the
leading character is *. To carry out this check, we make
the following assumptions.
token a global Variable that refers to the current symbol/character
getnext() a function which assigns the next symbol/character to token
Here is the translation of the above rule
T2 ::= * F T2 |.
If the leading symbol is * we call
getNext(), followed by F() and
T2(). Otherwise, we simply return (this corresponds to the
“epsilon” right-hand side).
Consider
F ::= N | (E)
N ::= 0 | 1 | 2
The above translates to
F() {
if token in {0, 1, 2}
then getNext(); return;
if token = (
then getNext();
E();
if token = )
then getNext(); return;
else ABORT;
else ABORT;
}We effectively unfold the non-terminal N. Hence, we can
immediately see that token must be equal one of the cases
0, 1, 2 or (.
Below we give a complete implementation of a top-down parser in Go where we apply the above translation scheme.
We give a top-down matcher for our expression language. Instead of a parser (that produces an AST), we only consider a matcher (that says yes/no if the expression is syntactically correct or not).
package main
import "fmt"
// Simple scanner/lexer
// Tokens
const (
EOS = 0 // End of string
ZERO = 1
ONE = 2
TWO = 3
OPEN = 4
CLOSE = 5
PLUS = 6
MULT = 7
)
func scan(s string) (string, int) {
for {
switch {
case len(s) == 0:
return s, EOS
case s[0] == '0':
return s[1:len(s)], ZERO
case s[0] == '1':
return s[1:len(s)], ONE
case s[0] == '2':
return s[1:len(s)], TWO
case s[0] == '+':
return s[1:len(s)], PLUS
case s[0] == '*':
return s[1:len(s)], MULT
case s[0] == '(':
return s[1:len(s)], OPEN
case s[0] == ')':
return s[1:len(s)], CLOSE
default: // simply skip everything else
s = s[1:len(s)]
}
}
}
type State struct {
s *string
tok int
}
func next(s *State) {
s2, tok := scan(*s.s)
s.s = &s2
s.tok = tok
}
// E ::= T E2
func matchE(s *State) bool {
if !matchT(s) {
return false
}
if !matchE2(s) {
return false
}
return true
}
// E2 ::= + T E2 |
func matchE2(s *State) bool {
if s.tok == PLUS {
next(s)
if !matchT(s) {
return false
}
if !matchE2(s) {
return false
}
}
return true
}
// T ::= F T2
func matchT(s *State) bool {
if !matchF(s) {
return false
}
if !matchT2(s) {
return false
}
return true
}
// T2 ::= * F T2 |
func matchT2(s *State) bool {
if s.tok == MULT {
next(s)
if !matchF(s) {
return false
}
if !matchT2(s) {
return false
}
}
return true
}
// F ::= N | (E)
func matchF(s *State) bool {
switch {
case s.tok == ZERO:
next(s)
return true
case s.tok == ONE:
next(s)
return true
case s.tok == TWO:
next(s)
return true
case s.tok == OPEN:
next(s)
if !matchE(s) {
return false
}
if s.tok != CLOSE {
return false
}
next(s)
return true
}
return false
}
func match(s string) bool {
st := State{&s, EOS}
next(&st)
b := matchE(&st)
if st.tok == EOS {
return b
}
return false
}
func debug(s string) {
fmt.Printf("%s", s)
}
func testMatcherBad() {
fmt.Printf("\n BAD \n")
fmt.Printf("%b", match(""))
fmt.Printf("%b", match("1 + "))
fmt.Printf("%b", match("1 + *"))
fmt.Printf("%b", match("( 1 "))
fmt.Printf("%b", match(" )"))
fmt.Printf("%b", match("1 )"))
fmt.Printf("%b", match("( 1 ) )"))
}
func testMatcherGood() {
fmt.Printf("\n GOOD \n")
fmt.Printf("%b", match("1"))
fmt.Printf("%b", match("1+0"))
fmt.Printf("%b", match("1 * 2 "))
fmt.Printf("%b", match(" (1) "))
fmt.Printf("%b", match(" (1 * (2)) "))
fmt.Printf("%b", match(" (1 + 2) * 0 "))
}
func main() {
testMatcherBad()
testMatcherGood()
fmt.Printf("\n")
}We adjust our parser to produce an AST. Instead of bool,
each parsing function reports in addition some AST. There are several
ways to represent ASTs. We choose to use Go’s type language to use an
AST that resembles algebraic data types as for example found in
Haskell.
The parser can be pretty much straightforwardly derived from the matcher. There is one tricky case.
Consider
E ::= T E2
E2 ::= + T E2 |
E returns an AST. The same applies to E2.
To do so, we need to supply the left operand for +T. That
is why the parser function for E2 takes as input
T’s AST and then returns E2’s AST. Details are
below.
package main
import "fmt"
import "strconv"
// AST
type Exp interface {
pretty() string
eval() int
}
type Num int
type Mult [2]Exp
type Plus [2]Exp
// pretty print
func (x Num) pretty() string {
return strconv.Itoa(int(x))
}
func (e Mult) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "*"
x += e[1].pretty()
x += ")"
return x
}
func (e Plus) pretty() string {
var x string
x = "("
x += e[0].pretty()
x += "+"
x += e[1].pretty()
x += ")"
return x
}
// eval
func (x Num) eval() int {
return int(x)
}
func (e Mult) eval() int {
return e[0].eval() * e[1].eval()
}
func (e Plus) eval() int {
return e[0].eval() + e[1].eval()
}
// Simple scanner/lexer
// Tokens
const (
EOS = 0 // End of string
ZERO = 1
ONE = 2
TWO = 3
OPEN = 4
CLOSE = 5
PLUS = 6
MULT = 7
)
func scan(s string) (string, int) {
for {
switch {
case len(s) == 0:
return s, EOS
case s[0] == '0':
return s[1:len(s)], ZERO
case s[0] == '1':
return s[1:len(s)], ONE
case s[0] == '2':
return s[1:len(s)], TWO
case s[0] == '+':
return s[1:len(s)], PLUS
case s[0] == '*':
return s[1:len(s)], MULT
case s[0] == '(':
return s[1:len(s)], OPEN
case s[0] == ')':
return s[1:len(s)], CLOSE
default: // simply skip everything else
s = s[1:len(s)]
}
}
}
type State struct {
s *string
tok int
}
func next(s *State) {
s2, tok := scan(*s.s)
s.s = &s2
s.tok = tok
}
// E ::= T E2
func parseE(s *State) (bool, Exp) {
b, e := parseT(s)
if !b {
return false, e
}
return parseE2(s, e)
}
// E2 ::= + T E2 |
func parseE2(s *State, e Exp) (bool, Exp) {
if s.tok == PLUS {
next(s)
b, f := parseT(s)
if !b {
return false, e
}
t := (Plus)([2]Exp{e, f})
return parseE2(s, t)
}
return true, e
}
// T ::= F T2
func parseT(s *State) (bool, Exp) {
b, e := parseF(s)
if !b {
return false, e
}
return parseT2(s, e)
}
// T2 ::= * F T2 |
func parseT2(s *State, e Exp) (bool, Exp) {
if s.tok == MULT {
next(s)
b, f := parseF(s)
if !b {
return false, e
}
t := (Mult)([2]Exp{e, f})
return parseT2(s, t)
}
return true, e
}
// F ::= N | (E)
func parseF(s *State) (bool, Exp) {
switch {
case s.tok == ZERO:
next(s)
return true, (Num)(0)
case s.tok == ONE:
next(s)
return true, (Num)(1)
case s.tok == TWO:
next(s)
return true, (Num)(2)
case s.tok == OPEN:
next(s)
b, e := parseE(s)
if !b {
return false, e
}
if s.tok != CLOSE {
return false, e
}
next(s)
return true, e
}
return false, (Num)(0)
}
func parse(s string) Exp {
st := State{&s, EOS}
next(&st)
_, e := parseE(&st)
if st.tok == EOS {
return e
}
return (Num)(0) // dummy value
}
func debug(s string) {
fmt.Printf("%s", s)
}
func test(s string) {
e := parse(s)
fmt.Printf("\n %s", e.pretty())
fmt.Printf("\n %d", e.eval())
}
func testParserGood() {
fmt.Printf("\n GOOD \n")
test("1")
test("1+0")
test("1 * 2 ")
test(" (1) ")
test(" (1 * (2)) ")
test(" (1 + 2) * 0 ")
}
// Helper functions to build ASTs by hand
func number(x int) Exp {
return Num(x)
}
func plus(x, y Exp) Exp {
return (Plus)([2]Exp{x, y})
// The type Plus is defined as the two element array consisting of Exp elements.
// Plus and [2]Exp are isomorphic but different types.
// We first build the AST value [2]Exp{x,y}.
// Then cast this value (of type [2]Exp) into a value of type Plus.
}
func mult(x, y Exp) Exp {
return (Mult)([2]Exp{x, y})
}
func examplesAST() {
ast1 := plus(mult(number(1), number(2)), number(0))
fmt.Printf("%s", ast1.pretty())
}
func main() {
examplesAST()
fmt.Printf("\n")
testParserGood()
}A parser is a function which takes a string and yields a result and the remaining string.
where T represents the parsing result (e.g. some
abstract syntax tree) and the returning string represents
the remaining input. As parsing may fail, we also return a Boolean value
to represent either success or failure.
We are using here Go with generics.
Application of a parser on some input.
A parser to accept a specific character (item).
func item(x byte) Parser[byte] {
return func(s string) (byte, string, bool) {
if len(s) == 0 {
return 0, s, false
} else if s[0] == x {
return s[0], s[1:len(s)], true
} else {
return 0, s, false
}
}
}Build a new parser by composition of an existing parser.
func (p1 Parser[T]) alt(p2 Parser[T]) Parser[T] {
return func(s string) (T, string, bool) {
v, rest, o := parse[T](p1, s)
// parse(p1,s) also works.
// The Go compiler is able to infer here the appropriate type instance.
if !o {
return parse[T](p2, s)
} else {
return v, rest, o
}
}
}
func (p1 Parser[T]) seq(p2 func(T) Parser[T]) Parser[T] {
return func(s string) (T, string, bool) {
v, rest, o := parse(p1, s)
if !o {
return v, rest, o
} else {
return parse(p2(v), rest)
}
}
}
func (p1 Parser[T]) conc(p2 Parser[T]) Parser[T] {
return p1.seq(func(v T) Parser[T] { return p2 })
}alt is left-biased, if the first argument (parser)
fails, we try the alternative.
seq runs two parser in sequence. Recall that as a side
product of parsing we usually expect to obtain a parse tree. With parser
combinators the result obtained can be arbitrary. Check the type of
seq!
conc is a specialized combinator where we simply
concatenate the two parser, ignoring the result obtained from the first
parser.
As we largely ignore here the result produced by parsing, we write a matcher function which checks if our parser matches some input string.
func matcher[T any](p Parser[T], s string) bool {
_, rest, o := parse(p, s)
return rest == "" && o
}
ex0 := item('c')
ex1 := item('a').alt(item('b'))
ex2 := ex0.conc(ex1)Most parser combinator libraries provide further combinators so we can even deal with left-recursive, ambiguous and even beyond context-free grammars. For details see here https://en.wikipedia.org/wiki/Parser_combinator.
In fact, some of these more expressive combinators can be directly encoded in terms of the host language. In the following, we show how to encode repitition (“Kleene star”).
Consider the following CFG productions
D -> 0 | 1
R -> D; | D,R
where we assume that 0, 1, ,
and ; are terminal symbols.
We consider a row of comma-separated 0s and 1s that are terminated by semicolon. We wish to sum up all 0s and 1s in such a row.
Here is a parser that accept 0 or 1.
func zeroOrOne() Parser[byte] {
return item('0').alt(item('1')).apply(func(x byte) byte { return x - 48 })
}The item parser yields as a result some asccii code.
Hence, we transform the result obtained by item by
subtracting 48 to obtain the actual numeric value. This is done via the
helper method apply.
func (p Parser[T]) apply(f func(T) T) Parser[T] {
return func(s string) (T, string, bool) {
x,s,b := parse(p,s)
return f(x),s,b
}
}Here is a parser that accepts ,.
Here comes the parser for a row of 0s and 1s.
func row(acc byte) Parser[byte] {
return zeroOrOne().seq(func(x byte) Parser[byte] {
return semicolon(acc+x).alt(comma().seq(func(_ byte) Parser[byte] {
return row(acc+x)
}))
})
}
func semicolon(x byte) Parser[byte] {
return item(';').apply(func(_ byte) byte { return x })
}We accumulate 0s and 1s.
We check for a leading 0 or 1. See zeroOrOne.
Followed by either semicolon (then we are done), or
comma.
In case of comma, we recursively call
row.
To summarize.
Consider the regular expression
a* c
which can also be defined in terms of the following CFG productions
A -> a A | c
Observation:
Our host language Go supports recursive function.
Encode the Kleene star via a recursive function yielding a parser for the above grammar.
func acceptAsC() Parser {
return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}Some points to note.
Symbol c tells us to stop, hence, is tried first.
Recall that alt is left-biased.
We can NOT write (the more direct) combinator program
Go is a strictly evaluated language and the above would immediately lead to a stack overflow. Hence, we need to ‘hide’ the recursive call within a (lambda) function.
Aside. In the lazy language Haskell, the more direct encoding is possible because program parts are lazily evaluated (i.e. only when needed).
Parser combinators are domain-specific language embedded into some host language (here Go).
Our host language Go is strongly type, so we can statically guarantee that our parser are “well-formed” (a parser is a composition of existing parser).
We make use of recursion available in the host language to extend the expressiveness of our parser combinator language (see encoding of Kleene star).
DSL (external):
M ::= 'c' | 'a' M
EDSL:
package main
// Parser combinators https://en.wikipedia.org/wiki/Parser_combinator
import "fmt"
type Parser[T any] func(s string) (T, string, bool)
func parse[T any](f Parser[T], s string) (T, string, bool) {
return f(s)
}
func item(x byte) Parser[byte] {
return func(s string) (byte, string, bool) {
if len(s) == 0 {
return 0, s, false
} else if s[0] == x {
return s[0], s[1:len(s)], true
} else {
return 0, s, false
}
}
}
func (p1 Parser[T]) alt(p2 Parser[T]) Parser[T] {
return func(s string) (T, string, bool) {
v, rest, o := parse[T](p1, s)
// parse(p1,s) also works.
// The Go compiler is able to infer here the appropriate type instance.
if !o {
return parse[T](p2, s)
} else {
return v, rest, o
}
}
}
func (p1 Parser[T]) seq(p2 func(T) Parser[T]) Parser[T] {
return func(s string) (T, string, bool) {
v, rest, o := parse(p1, s)
if !o {
return v, rest, o
} else {
return parse(p2(v), rest)
}
}
}
func (p1 Parser[T]) conc(p2 Parser[T]) Parser[T] {
return p1.seq(func(v T) Parser[T] { return p2 })
}
func (p Parser[T]) apply(f func(T) T) Parser[T] {
return func(s string) (T, string, bool) {
x,s,b := parse(p,s)
return f(x),s,b
}
}
func matcher[T any](p Parser[T], s string) bool {
_, rest, o := parse(p, s)
return rest == "" && o
}
// Parsing and suming up a row of comma-separated 0s and 1s that are terminated by semicolon.
func zeroOrOne() Parser[byte] {
return item('0').alt(item('1')).apply(func(x byte) byte { return x - 48 })
}
func comma() Parser[byte] {
return item(',').apply(func(_ byte) byte { return 0 })
}
func semicolon(x byte) Parser[byte] {
return item(';').apply(func(_ byte) byte { return x })
}
func row(acc byte) Parser[byte] {
return zeroOrOne().seq(func(x byte) Parser[byte] {
return semicolon(acc+x).alt(comma().seq(func(_ byte) Parser[byte] {
return row(acc+x)
}))
})
}
func testRow() {
run := func(p Parser[byte], s string) {
x,_,b := parse(p,s)
if b {
fmt.Printf("\n %s yields %d", s,x)
} else {
fmt.Printf("\n %s fails",s)
}
}
run(row(0), "1;")
run(row(0), "0;")
run(row(0), "0,1,1,0;")
run(row(0), "0,1,1, 0;")
}
func acceptAsC() Parser[byte] {
return item('c').alt(item('a').seq(func(v byte) Parser[byte] { return acceptAsC() }))
}
func acceptAsC2() Parser[byte] {
return item('c').alt(item('a').conc(acceptAsC2()))
}
func test() {
fmt.Printf("%b", matcher(acceptAsC(), "ac"))
// fmt.Printf("%b", matcher(acceptAsC2(), "ac"))
ex0 := item('c')
fmt.Printf("%b", matcher(ex0, "c"))
fmt.Printf("%b", matcher(ex0, "cd"))
ex1 := item('a').alt(item('b'))
fmt.Printf("%b", matcher(ex1, "a"))
fmt.Printf("%b", matcher(ex1, "b"))
fmt.Printf("%b", matcher(ex1, "c"))
fmt.Printf("%b", matcher(ex1, "ab"))
ex2 := ex0.conc(ex1)
fmt.Printf("%b", matcher(ex2, "cb"))
fmt.Printf("%b", matcher(ex2, "ad"))
}
func main() {
// test()
testRow()
fmt.Printf("\n done")
}yacc = yet another compiler compiler
lex = lexer (tokenizer) based on regular expressions
bison/flex GNU variants of yacc/lex
yacc/lex require the following inputs
lexer file (.l) defines tokens (based on regular expressions)
parser File (.y) defines CFG rules (must be suitable for bottom-up parsing
Outputs are C programs that represent the (bottom-up) parser.
// calc2.l
%{
#include "calc2.tab.h"
%}
%%
"+" {return(PLUS);}
"*" {return(TIMES);}
"(" {return(LPAREN);}
")" {return(RPAREN);}
[\n] {return(CR);}
[0-9]+ {yylval=atoi(yytext); return(INT);}
[\t ] ;
%%// calc2.y
%{
/*
All of the below productions that do not have associated
actions are using the DEFAULT action -- $$ = $1
*/
%}
%token PLUS TIMES INT CR RPAREN LPAREN
%%
lines : lines line
| line
;
line : expr CR {printf("= %d\n",$1); }
;
expr : expr PLUS term {$$ = $1 + $3; }
| term
;
term : term TIMES factor {$$ = $1 * $3; }
| factor
;
factor : LPAREN expr RPAREN {$$ = $2;}
| INT
;
%%Productions = CFG rules
For each CFG rule there can be a semantic rule
Semantic rules uses attributes (and may make use of C code)
Attributes are connected to AST nodes where the shape of the AST corresponds the shape of the CFG
Consider
$1 is an attribute connected to expr (on
right-hand side)
$$ is an attribute connected to expr (on
left-hand side)
yacc assumes synthesized attributes. These are attributes
that can be computed based on the children attributes. Consider
$$ = $1 + $3.
Syntesized attributes float up the AST whereas inherited attributes are pushed down the AST.
Compututation of synthesized attributes matches bottom-up parsing strategy.
make file to build
the parser%% Makefile
NAME = calc2
$(NAME): $(NAME).tab.c lex.yy.c
gcc $(NAME).tab.c lex.yy.c -o $(NAME) -ly -ll
$(NAME).tab.c: $(NAME).y
bison -vd $(NAME).y
lex.yy.c: $(NAME).l
flex $(NAME).l