Background: Syntax Analysis (parsing)

Compiler pipeline:

     Syntax Analysis         --->   Semantic Analysie   --->   Optimization    --> Codegen 

Ex1:    x + * z

       syntax error

Ex2:    x + true                   semantic error
                                  can't mix Int and Bool

Ex3:    f(1) + f(1)               all good assuming          let x = f(1)         Native, JVM, ...
                                  f of type Int -> Int       in x + x

Our interest: Syntax Anlysis.

Check if the program is syntactically valid.
Transform the program (=string) into a more suitable representation for further analysis (AST).

Context-free grammars

How to describe that an arithmetic expression is syntactically well-formed?

Examples.

1, 2, ...

1 + 2

(1 + 3) * 2

...

Let's be more formal.

E is an expression if and only if (iff)

E is a number
E is either of the form E1+E2, E1*E2 where E1 and E2 are expressions.
E is of the form (E1) where E1 is an expression.

Observation. We can use (context-free grammar) rules to describe expressions.

E -> 0
E -> 1
...
...
E -> E + E
E -> E * E
E -> (E)

The rules for numbers are tedious.

E -> N
N -> 0
...
N -> 9
N -> 0N
...
N -> 9N

Could be much simplified if we use regular expressions.

E -> ['0'-'9']+

where we make use of the following short-hands.

r+ = r r*.

['0'-'9'] is a short-form for '0'|...|'9'.

There are a number of further short-hands and notations to make context-free grammars more accessible. See Extended Backus–Naur form.

Expression syntax

We loosely follow EBNF. For brevity, we only consider the numbers 0, 1 and 2.

 N ::= 0 | 1 | 2 

 E ::= N | (E) | E + E | E * E

We refer to N and E as non-terminal symbols and 0, 1, 2, (, ), + und * as terminal symbols.

Anything you notice? The EBNF grammar is ambiguous. Ambiguous means here that for the same word, we find two derivations using the above rules.

 (1) 
          E
       -> E + E
       -> N + E
       -> 1 + E
       -> 1 + E * E
       -> 1 + N * E
       -> 1 + 2 * E
       -> 1 + 2 * N
       -> 1 + 2 * 1

 (2)

          E
       -> E * E
       -> E + E * E
       -> 1 + E * E
       -> 1 + 2 * E
       -> 1 + 2 * 1

What is the consequence of ambiguity? In case of (1), we interpret "1 + 2 * 1" as 1 + (2 * 1). In case of (2), we interpret "1 + 2 * 1" as (1 + 2) * 1.

Unambiguity of a grammar is therefore important. Otherwise, compiler A (Oracle) might follow the interpretation (1) and compiler B (Microsoft) follows the interpretation (2). The same programm will be interpreted differently!

How to resolve the ambiguity for our example? In general, we assume that * binds tigher than +. This can be enforced via the following grammar.

E ::= E + T | T

T ::= T * F | F

F ::= N | (E)

Recall the above example.

          E
       -> E + T
       -> E + T * F
       -> T + T * F
       -> F + T * F
       -> F + F * F
       -> N + F * F
       -> 1 + F * F
       -> 1 + N * F
       -> 1 + 2 * F
       -> 1 + 2 * N
       -> 1 + 2 * 1

Interpretation is 1 + (2 * 1).

Parse trees

A derivation can be represented as a parse tree.

Idea:

Each rule consists of a left-hand side (non-terminal) and a right-hand side (mix of terminals and non-terminals).

We view each non-terminal as an intermediate node in a tree. Each terminal is a leaf node.

The left-hand side terminal is therefore the parent node. The childern are found on the right-hand side.

Consider

E ::= E + T

viewed as a tree

        E
      / | \
     E  +  T

Example

Consider the derivation.

     E
  -> T
  -> F
  -> (E)
  -> (E + T)
  -> (T + T)
  -> (F + T)
  -> (F + F)
  -> (N + F)
  -> (N + N)
  -> (1 + 0)

Represented as a parse tree.

Concrete versus abstract parse trees

The above parse tree represents the concrete syntax where all details of the original expression remain visible. We refer to the above as a concrete parse tree.

In further transformation steps, not all (concrete) syntactic details are necessary. For example, intermediate steps like "E reduced to T" are not relevant. Furthermore, we can ignore parantheses as they are (implictely) represented by the structure of the tree.

The thus simplified tree is referred to as the abstract parse tree (aka abstract syntax tree = AST) and captures the abstract syntax of expressions.

         +
        / \
       1   0

In the AST representation we furthermore make use of binary termainal symbols in intermediate nodes.

General Parser

Siehe Cocke–Younger–Kasami algorithm.

Cubic complexity.

Top-Down Parsing

Deterministic in linear time.

Generally, requires to rewrite the grammar to a specific form. This is not always possible. So, not every grammar is suitable for top-down parsing.

Top-down matcher for expresssions

We show a top-down matcher for our expression language. Some magic here which can not be explained within the given time frame. Instead of a parser (that produces an AST), we also only consider a matcher (that says yes/no if the expression is syntactically correct or not).

package main

import "fmt"

// Simple scanner/lexer

// Tokens
const (
    EOS   = 0 // End of string
    ZERO  = 1
    ONE   = 2
    TWO   = 3
    OPEN  = 4
    CLOSE = 5
    PLUS  = 6
    MULT  = 7
)

func scan(s string) (string, int) {
    for {
        switch {
        case len(s) == 0:
            return s, EOS
        case s[0] == '0':
            return s[1:len(s)], ZERO
        case s[0] == '1':
            return s[1:len(s)], ONE
        case s[0] == '2':
            return s[1:len(s)], TWO
        case s[0] == '+':
            return s[1:len(s)], PLUS
        case s[0] == '*':
            return s[1:len(s)], MULT
        case s[0] == '(':
            return s[1:len(s)], OPEN
        case s[0] == ')':
            return s[1:len(s)], CLOSE
        default: // simply skip everything else
            s = s[1:len(s)]
        }

    }
}

type State struct {
    s   *string
    tok int
}

func next(s *State) {
    s2, tok := scan(*s.s)

    s.s = &s2
    s.tok = tok
}

// E  ::= T E'
func matchE(s *State) bool {
    if !matchT(s) {
        return false
    }
    if !matchE2(s) {
        return false
    }

    return true
}

// E' ::= + T E' |
func matchE2(s *State) bool {
    if s.tok == PLUS {
        next(s)
        if !matchT(s) {
            return false
        }
        if !matchE2(s) {
            return false
        }

    }

    return true
}

// T  ::= F T'
func matchT(s *State) bool {
    if !matchF(s) {
        return false
    }
    if !matchT2(s) {
        return false
    }
    return true
}

// T' ::= * F T' |
func matchT2(s *State) bool {
    if s.tok == MULT {
        next(s)
        if !matchF(s) {
            return false
        }
        if !matchT2(s) {
            return false
        }
    }
    return true
}

// F ::= N | (E)
func matchF(s *State) bool {
    switch {
    case s.tok == ZERO:
        next(s)
        return true
    case s.tok == ONE:
        next(s)
        return true
    case s.tok == TWO:
        next(s)
        return true
    case s.tok == OPEN:
        next(s)
        if !matchE(s) {
            return false
        }
        if s.tok != CLOSE {
            return false
        }
        next(s)
        return true
    }

    return false
}

func match(s string) bool {
    st := State{&s, EOS}
    next(&st)
    b := matchE(&st)
    if st.tok == EOS {
        return b
    }
    return false
}

func debug(s string) {
    fmt.Printf("%s", s)
}

func testMatcherBad() {
    fmt.Printf("\n BAD \n")
    fmt.Printf("%b", match(""))
    fmt.Printf("%b", match("1 + "))
    fmt.Printf("%b", match("1 + *"))
    fmt.Printf("%b", match("( 1 "))
    fmt.Printf("%b", match(" )"))
    fmt.Printf("%b", match("1 )"))
    fmt.Printf("%b", match("( 1 ) )"))
}

func testMatcherGood() {
    fmt.Printf("\n GOOD \n")
    fmt.Printf("%b", match("1"))
    fmt.Printf("%b", match("1+0"))
    fmt.Printf("%b", match("1 * 2 "))
    fmt.Printf("%b", match(" (1) "))
    fmt.Printf("%b", match(" (1 * (2)) "))
    fmt.Printf("%b", match(" (1 + 2) * 0 "))
}

func main() {

    testMatcherBad()

    testMatcherGood()
    fmt.Printf("\n")

}

What is a (functional) parser?

A parser is a function which takes a string and yields a result and the remaining string.

type Parser func(s string) (interface{}, string, bool)

where interface{} represents the parsing result (e.g. some abstract syntax tree) and the returning string represents the remaining input. As parsing may fail, we also return a Boolean value to represent either success or failure.

Application of a parser on some input.

func parse(f Parser, s string) (interface{}, string, bool) {
    return f(s)
}

Basic combinators

The epsilon combinator which parses the empty string. We leave the input string untouched and report as a result the empty string.

func eps() Parser {
    return func(s string) (interface{}, string, bool) {
        return "", s, true
    }
}

A parser to accept a specific character (item).

func item(x byte) Parser {
    return func(s string) (interface{}, string, bool) {
        if len(s) == 0 {
            return 0, s, false
        } else if s[0] == x {
            return s[0], s[1:len(s)], true
        } else {
            return 0, s, false
        }
    }
}

Building higher-order combinators

Build a new parser by composition of an exisisting parser.

func (p1 Parser) alt(p2 Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return parse(p2, s)
        } else {
            return v, rest, o
        }

    }
}

func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return v, rest, o
        } else {
            return parse(p2(v), rest)
        }

    }

}

func (p1 Parser) conc(p2 Parser) Parser {
    return p1.seq(func(v interface{}) Parser { return p2 })
}

alt is left-biased, if the first argument (parser) fails, we try the alternative.

seq runs two parsers in sequence. Recall that as a side product of parsing we usually expect to obtain a parse tree. With parser combinators the result obtained can be arbitrary. Check the type of seq!

conc is a specialized combinator where we simply concatenate the two parsers, ignoring the result obtained from the first parser.

Some simple examples

As we largely ignore here the result produced by parsing, we write a matcher function which checks if our parser matches some input string.

func matcher(p Parser, s string) bool {
    _, rest, o := parse(p, s)
    return rest == "" && o
}


ex0 := item('c')
ex1 := item('a').alt(item('b'))
ex2 := ex0.conc(ex1)

More expressive parser combinators

Most parser combinator libraries provide further combinators so we can even deal with left-recursive, ambiguous and even beyond context-free grammars. For details see here https://en.wikipedia.org/wiki/Parser_combinator.

In fact, some of these more expressive combinators can be directly encoded in terms of the host language. In the following, we show how to encode Kleene star.

Consider the regular expression

a* c

which can also be defined in terms of the following CFG productions

A -> a A | c

Observation:

Our host language Go supports recursive function.
Encode the Kleene star via a recursive function yielding a parser for the above grammar.

func acceptAsC() Parser {
    return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}

Some points to note.

Symbol c tells us to stop, hence, is tried first. Recall that alt is left-biased.
We can NOT write (the more direct) combinator program

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}

Go is a strictly evaluated language and the above would immediately lead to a stack overflow. Hence, we need to 'hide' the recursive call within a (lambda) function.

Aside. In the lazy language Haskell, the more direct encoding is possible because program parts are lazily evaluated (i.e. only when needed).

Summary

Parser combinators are domain-specific language embedded into some host language (here Go).
Our host language Go is strongly type, so we can statically guarantee that our parsers are "well-formed" (a parser is a composition of existing parsers).
We make use of recursion available in the host language to extend the expressiveness of our parser combinator language (see encoding of Kleene star).

DSL (external):

M ::= 'c' | 'a' M

EDSL: _{_{_{_{_{_{{.go} func acceptAsC2() Parser { return item('c').alt(item('a').conc(acceptAsC2())) }}}}}}~}~~

Complete source code

package main

// Parser combinators https://en.wikipedia.org/wiki/Parser_combinator

import "fmt"

type Parser func(s string) (interface{}, string, bool)

func parse(f Parser, s string) (interface{}, string, bool) {
    return f(s)
}

func eps() Parser {
    return func(s string) (interface{}, string, bool) {
        return "", s, true
    }
}

func item(x byte) Parser {
    return func(s string) (interface{}, string, bool) {
        if len(s) == 0 {
            return 0, s, false
        } else if s[0] == x {
            return s[0], s[1:len(s)], true
        } else {
            return 0, s, false
        }
    }
}

func (p1 Parser) alt(p2 Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return parse(p2, s)
        } else {
            return v, rest, o
        }

    }
}

func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return v, rest, o
        } else {
            return parse(p2(v), rest)
        }

    }

}

func (p1 Parser) conc(p2 Parser) Parser {
    return p1.seq(func(v interface{}) Parser { return p2 })
}

func matcher(p Parser, s string) bool {
    _, rest, o := parse(p, s)
    return rest == "" && o
}

func acceptAsC() Parser {
    return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}



func main() {

      fmt.Printf("%b", matcher(acceptAsC(), "ac"))  
//    fmt.Printf("%b", matcher(acceptAsC2(), "ac"))
    

    ex0 := item('c')
    fmt.Printf("%b", matcher(ex0, "c"))
    fmt.Printf("%b", matcher(ex0, "cd"))

    ex1 := item('a').alt(item('b'))
    fmt.Printf("%b", matcher(ex1, "a"))
    fmt.Printf("%b", matcher(ex1, "b"))
    fmt.Printf("%b", matcher(ex1, "c"))
    fmt.Printf("%b", matcher(ex1, "ab"))

    ex2 := ex0.conc(ex1)
    fmt.Printf("%b", matcher(ex2, "cb"))
    fmt.Printf("%b", matcher(ex2, "ad"))        

}

Extended Example: Parser Combinators

Parser combinators

DSL versus GPL

The problem domain we consider