Martin Sulzmann
Highlights:
Playing with higher-order functions.
Example of an embedded (internal) domain-specific language (DSL).
DSL seperated from host language via types.
GPL = general-purpose language (C, Java, Kotlin, Haskell, Go, Python, ...)
DSL = domain-specific language (SQL, Latex, bash, perl, ...)
DSLs are usually external languages. Come with their own syntax, interpreter, compiler, ... ("eco-system").
Internal DSLs (a.k.a. embedded DSLs aka APIs aka Frameworks). Make use of the host language (GPL) to provide similar functionality like a external DSL. Often easier to evolve and adapt. Can reuse existing eco-system. Might to be not as optimized as external DSL. Other issues, domain-specific error handling etc.
There are lots of different parsing approaches.
For example, see LL and LR parsing.
Another popular approach are https://en.wikipedia.org/wiki/Parser_combinator.
Main idea:
Use combinators to build parsers.
Combinators are fancy API functions which hide much of the plumbing necessary to carry out parsing.
Often, a EBNF specification can be directly translated into some sequence of combinator calls.
There exists lots of parser combinator libraries for most programming languages.
Compared to parsing tools such as yacc and ANTLR, the parser is embedded into the host language. So, parser combinators form an internal domain-specific language (DSL) (aka embedded DSL) whereas yacc and ANTLR are external DSLs.
In the following, we give a (rather naive) encoding of a parser combinator library in Go.
Compiler pipeline:
Syntax Analysis ---> Semantic Analysie ---> Optimization --> Codegen
Ex1: x + * z
syntax error
Ex2: x + true semantic error
can't mix Int and Bool
Ex3: f(1) + f(1) all good assuming let x = f(1) Native, JVM, ...
f of type Int -> Int in x + x
Our interest: Syntax Anlysis.
Check if the program is syntactically valid.
Transform the program (=string) into a more suitable representation for further analysis (AST).
How to describe that an arithmetic expression is syntactically well-formed?
Examples.
1, 2, ...
1 + 2
(1 + 3) * 2
...
Let's be more formal.
E is an expression if and only if (iff)
E is a number
E is either of the form E1+E2, E1*E2 where E1 and E2 are expressions.
E is of the form (E1) where E1 is an expression.
Observation. We can use (context-free grammar) rules to describe expressions.
E -> 0
E -> 1
...
...
E -> E + E
E -> E * E
E -> (E)
The rules for numbers are tedious.
E -> N
N -> 0
...
N -> 9
N -> 0N
...
N -> 9N
Could be much simplified if we use regular expressions.
E -> ['0'-'9']+
where we make use of the following short-hands.
r+ = r r*
.
['0'-'9']
is a short-form for '0'|...|'9'
.
There are a number of further short-hands and notations to make context-free grammars more accessible. See Extended Backus–Naur form.
We loosely follow EBNF. For brevity, we only consider the numbers 0, 1 and 2.
N ::= 0 | 1 | 2
E ::= N | (E) | E + E | E * E
We refer to N and E as non-terminal symbols and 0, 1, 2, (, ), + und * as terminal symbols.
Anything you notice? The EBNF grammar is ambiguous. Ambiguous means here that for the same word, we find two derivations using the above rules.
(1)
E
-> E + E
-> N + E
-> 1 + E
-> 1 + E * E
-> 1 + N * E
-> 1 + 2 * E
-> 1 + 2 * N
-> 1 + 2 * 1
(2)
E
-> E * E
-> E + E * E
-> 1 + E * E
-> 1 + 2 * E
-> 1 + 2 * 1
What is the consequence of ambiguity? In case of (1), we interpret "1 + 2 * 1" as 1 + (2 * 1)
. In case of (2), we interpret "1 + 2 * 1" as (1 + 2) * 1
.
Unambiguity of a grammar is therefore important. Otherwise, compiler A (Oracle) might follow the interpretation (1) and compiler B (Microsoft) follows the interpretation (2). The same programm will be interpreted differently!
How to resolve the ambiguity for our example? In general, we assume that *
binds tigher than +
. This can be enforced via the following grammar.
E ::= E + T | T
T ::= T * F | F
F ::= N | (E)
Recall the above example.
E
-> E + T
-> E + T * F
-> T + T * F
-> F + T * F
-> F + F * F
-> N + F * F
-> 1 + F * F
-> 1 + N * F
-> 1 + 2 * F
-> 1 + 2 * N
-> 1 + 2 * 1
Interpretation is 1 + (2 * 1)
.
A derivation can be represented as a parse tree.
Idea:
Each rule consists of a left-hand side (non-terminal) and a right-hand side (mix of terminals and non-terminals).
We view each non-terminal as an intermediate node in a tree. Each terminal is a leaf node.
The left-hand side terminal is therefore the parent node. The childern are found on the right-hand side.
Consider
E ::= E + T
viewed as a tree
E
/ | \
E + T
Consider the derivation.
E
-> T
-> F
-> (E)
-> (E + T)
-> (T + T)
-> (F + T)
-> (F + F)
-> (N + F)
-> (N + N)
-> (1 + 0)
Represented as a parse tree.
E
|
T
|
F
/ | \
( E )
/ | \
E + T
| |
T F
| |
F N
| |
N 0
|
1
The above parse tree represents the concrete syntax where all details of the original expression remain visible. We refer to the above as a concrete parse tree.
In further transformation steps, not all (concrete) syntactic details are necessary. For example, intermediate steps like "E reduced to T" are not relevant. Furthermore, we can ignore parantheses as they are (implictely) represented by the structure of the tree.
The thus simplified tree is referred to as the abstract parse tree (aka abstract syntax tree = AST) and captures the abstract syntax of expressions.
+
/ \
1 0
In the AST representation we furthermore make use of binary termainal symbols in intermediate nodes.
Siehe Cocke–Younger–Kasami algorithm.
Cubic complexity.
Deterministic in linear time.
Generally, requires to rewrite the grammar to a specific form. This is not always possible. So, not every grammar is suitable for top-down parsing.
We show a top-down matcher for our expression language. Some magic here which can not be explained within the given time frame. Instead of a parser (that produces an AST), we also only consider a matcher (that says yes/no if the expression is syntactically correct or not).
package main
import "fmt"
// Simple scanner/lexer
// Tokens
const (
EOS = 0 // End of string
ZERO = 1
ONE = 2
TWO = 3
OPEN = 4
CLOSE = 5
PLUS = 6
MULT = 7
)
func scan(s string) (string, int) {
for {
switch {
case len(s) == 0:
return s, EOS
case s[0] == '0':
return s[1:len(s)], ZERO
case s[0] == '1':
return s[1:len(s)], ONE
case s[0] == '2':
return s[1:len(s)], TWO
case s[0] == '+':
return s[1:len(s)], PLUS
case s[0] == '*':
return s[1:len(s)], MULT
case s[0] == '(':
return s[1:len(s)], OPEN
case s[0] == ')':
return s[1:len(s)], CLOSE
default: // simply skip everything else
s = s[1:len(s)]
}
}
}
type State struct {
s *string
tok int
}
func next(s *State) {
s2, tok := scan(*s.s)
s.s = &s2
s.tok = tok
}
// E ::= T E'
func matchE(s *State) bool {
if !matchT(s) {
return false
}
if !matchE2(s) {
return false
}
return true
}
// E' ::= + T E' |
func matchE2(s *State) bool {
if s.tok == PLUS {
next(s)
if !matchT(s) {
return false
}
if !matchE2(s) {
return false
}
}
return true
}
// T ::= F T'
func matchT(s *State) bool {
if !matchF(s) {
return false
}
if !matchT2(s) {
return false
}
return true
}
// T' ::= * F T' |
func matchT2(s *State) bool {
if s.tok == MULT {
next(s)
if !matchF(s) {
return false
}
if !matchT2(s) {
return false
}
}
return true
}
// F ::= N | (E)
func matchF(s *State) bool {
switch {
case s.tok == ZERO:
next(s)
return true
case s.tok == ONE:
next(s)
return true
case s.tok == TWO:
next(s)
return true
case s.tok == OPEN:
next(s)
if !matchE(s) {
return false
}
if s.tok != CLOSE {
return false
}
next(s)
return true
}
return false
}
func match(s string) bool {
st := State{&s, EOS}
next(&st)
b := matchE(&st)
if st.tok == EOS {
return b
}
return false
}
func debug(s string) {
fmt.Printf("%s", s)
}
func testMatcherBad() {
fmt.Printf("\n BAD \n")
fmt.Printf("%b", match(""))
fmt.Printf("%b", match("1 + "))
fmt.Printf("%b", match("1 + *"))
fmt.Printf("%b", match("( 1 "))
fmt.Printf("%b", match(" )"))
fmt.Printf("%b", match("1 )"))
fmt.Printf("%b", match("( 1 ) )"))
}
func testMatcherGood() {
fmt.Printf("\n GOOD \n")
fmt.Printf("%b", match("1"))
fmt.Printf("%b", match("1+0"))
fmt.Printf("%b", match("1 * 2 "))
fmt.Printf("%b", match(" (1) "))
fmt.Printf("%b", match(" (1 * (2)) "))
fmt.Printf("%b", match(" (1 + 2) * 0 "))
}
func main() {
testMatcherBad()
testMatcherGood()
fmt.Printf("\n")
}
A parser is a function which takes a string and yields a result and the remaining string.
type Parser func(s string) (interface{}, string, bool)
where interface{}
represents the parsing result (e.g. some abstract syntax tree) and the returning string
represents the remaining input. As parsing may fail, we also return a Boolean value to represent either success or failure.
Application of a parser on some input.
func parse(f Parser, s string) (interface{}, string, bool) {
return f(s)
}
The epsilon combinator which parses the empty string. We leave the input string untouched and report as a result the empty string.
func eps() Parser {
return func(s string) (interface{}, string, bool) {
return "", s, true
}
}
A parser to accept a specific character (item).
func item(x byte) Parser {
return func(s string) (interface{}, string, bool) {
if len(s) == 0 {
return 0, s, false
} else if s[0] == x {
return s[0], s[1:len(s)], true
} else {
return 0, s, false
}
}
}
Build a new parser by composition of an exisisting parser.
func (p1 Parser) alt(p2 Parser) Parser {
return func(s string) (interface{}, string, bool) {
v, rest, o := parse(p1, s)
if !o {
return parse(p2, s)
} else {
return v, rest, o
}
}
}
func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
return func(s string) (interface{}, string, bool) {
v, rest, o := parse(p1, s)
if !o {
return v, rest, o
} else {
return parse(p2(v), rest)
}
}
}
func (p1 Parser) conc(p2 Parser) Parser {
return p1.seq(func(v interface{}) Parser { return p2 })
}
alt
is left-biased, if the first argument (parser) fails, we try the alternative.
seq
runs two parsers in sequence. Recall that as a side product of parsing we usually expect to obtain a parse tree. With parser combinators the result obtained can be arbitrary. Check the type of seq
!
conc
is a specialized combinator where we simply concatenate the two parsers, ignoring the result obtained from the first parser.
As we largely ignore here the result produced by parsing, we write a matcher function which checks if our parser matches some input string.
func matcher(p Parser, s string) bool {
_, rest, o := parse(p, s)
return rest == "" && o
}
ex0 := item('c')
ex1 := item('a').alt(item('b'))
ex2 := ex0.conc(ex1)
Most parser combinator libraries provide further combinators so we can even deal with left-recursive, ambiguous and even beyond context-free grammars. For details see here https://en.wikipedia.org/wiki/Parser_combinator.
In fact, some of these more expressive combinators can be directly encoded in terms of the host language. In the following, we show how to encode Kleene star.
Consider the regular expression
a* c
which can also be defined in terms of the following CFG productions
A -> a A | c
Observation:
Our host language Go supports recursive function.
Encode the Kleene star via a recursive function yielding a parser for the above grammar.
func acceptAsC() Parser {
return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}
Some points to note.
Symbol c
tells us to stop, hence, is tried first. Recall that alt
is left-biased.
We can NOT write (the more direct) combinator program
func acceptAsC2() Parser {
return item('c').alt(item('a').conc(acceptAsC2()))
}
Go is a strictly evaluated language and the above would immediately lead to a stack overflow. Hence, we need to 'hide' the recursive call within a (lambda) function.
Aside. In the lazy language Haskell, the more direct encoding is possible because program parts are lazily evaluated (i.e. only when needed).
Parser combinators are domain-specific language embedded into some host language (here Go).
Our host language Go is strongly type, so we can statically guarantee that our parsers are "well-formed" (a parser is a composition of existing parsers).
We make use of recursion available in the host language to extend the expressiveness of our parser combinator language (see encoding of Kleene star).
DSL (external):
M ::= 'c' | 'a' M
EDSL: {.go} func acceptAsC2() Parser { return item('c').alt(item('a').conc(acceptAsC2())) }~~~
package main
// Parser combinators https://en.wikipedia.org/wiki/Parser_combinator
import "fmt"
type Parser func(s string) (interface{}, string, bool)
func parse(f Parser, s string) (interface{}, string, bool) {
return f(s)
}
func eps() Parser {
return func(s string) (interface{}, string, bool) {
return "", s, true
}
}
func item(x byte) Parser {
return func(s string) (interface{}, string, bool) {
if len(s) == 0 {
return 0, s, false
} else if s[0] == x {
return s[0], s[1:len(s)], true
} else {
return 0, s, false
}
}
}
func (p1 Parser) alt(p2 Parser) Parser {
return func(s string) (interface{}, string, bool) {
v, rest, o := parse(p1, s)
if !o {
return parse(p2, s)
} else {
return v, rest, o
}
}
}
func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
return func(s string) (interface{}, string, bool) {
v, rest, o := parse(p1, s)
if !o {
return v, rest, o
} else {
return parse(p2(v), rest)
}
}
}
func (p1 Parser) conc(p2 Parser) Parser {
return p1.seq(func(v interface{}) Parser { return p2 })
}
func matcher(p Parser, s string) bool {
_, rest, o := parse(p, s)
return rest == "" && o
}
func acceptAsC() Parser {
return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}
func acceptAsC2() Parser {
return item('c').alt(item('a').conc(acceptAsC2()))
}
func main() {
fmt.Printf("%b", matcher(acceptAsC(), "ac"))
// fmt.Printf("%b", matcher(acceptAsC2(), "ac"))
ex0 := item('c')
fmt.Printf("%b", matcher(ex0, "c"))
fmt.Printf("%b", matcher(ex0, "cd"))
ex1 := item('a').alt(item('b'))
fmt.Printf("%b", matcher(ex1, "a"))
fmt.Printf("%b", matcher(ex1, "b"))
fmt.Printf("%b", matcher(ex1, "c"))
fmt.Printf("%b", matcher(ex1, "ab"))
ex2 := ex0.conc(ex1)
fmt.Printf("%b", matcher(ex2, "cb"))
fmt.Printf("%b", matcher(ex2, "ad"))
}