Extended Example: Parser Combinators

Martin Sulzmann

Parser combinators

Highlights:

DSL versus GPL

GPL = general-purpose language (C, Java, Kotlin, Haskell, Go, Python, ...)

DSL = domain-specific language (SQL, Latex, bash, perl, ...)

DSLs are usually external languages. Come with their own syntax, interpreter, compiler, ... ("eco-system").

Internal DSLs (a.k.a. embedded DSLs aka APIs aka Frameworks). Make use of the host language (GPL) to provide similar functionality like a external DSL. Often easier to evolve and adapt. Can reuse existing eco-system. Might to be not as optimized as external DSL. Other issues, domain-specific error handling etc.

The problem domain we consider

There are lots of different parsing approaches.

For example, see LL and LR parsing.

Another popular approach are https://en.wikipedia.org/wiki/Parser_combinator.

Main idea:

In the following, we give a (rather naive) encoding of a parser combinator library in Go.

Background: Syntax Analysis (parsing)

Compiler pipeline:

     Syntax Analysis         --->   Semantic Analysie   --->   Optimization    --> Codegen 

Ex1:    x + * z

       syntax error

Ex2:    x + true                   semantic error
                                  can't mix Int and Bool

Ex3:    f(1) + f(1)               all good assuming          let x = f(1)         Native, JVM, ...
                                  f of type Int -> Int       in x + x

Our interest: Syntax Anlysis.

Context-free grammars

How to describe that an arithmetic expression is syntactically well-formed?

Examples.

1, 2, ...

1 + 2

(1 + 3) * 2

...

Let's be more formal.

E is an expression if and only if (iff)

Observation. We can use (context-free grammar) rules to describe expressions.

E -> 0
E -> 1
...
...
E -> E + E
E -> E * E
E -> (E)

The rules for numbers are tedious.

E -> N
N -> 0
...
N -> 9
N -> 0N
...
N -> 9N

Could be much simplified if we use regular expressions.

E -> ['0'-'9']+

where we make use of the following short-hands.

r+ = r r*.

['0'-'9'] is a short-form for '0'|...|'9'.

There are a number of further short-hands and notations to make context-free grammars more accessible. See Extended Backus–Naur form.

Expression syntax

We loosely follow EBNF. For brevity, we only consider the numbers 0, 1 and 2.

 N ::= 0 | 1 | 2 

 E ::= N | (E) | E + E | E * E

We refer to N and E as non-terminal symbols and 0, 1, 2, (, ), + und * as terminal symbols.

Anything you notice? The EBNF grammar is ambiguous. Ambiguous means here that for the same word, we find two derivations using the above rules.

 (1) 
          E
       -> E + E
       -> N + E
       -> 1 + E
       -> 1 + E * E
       -> 1 + N * E
       -> 1 + 2 * E
       -> 1 + 2 * N
       -> 1 + 2 * 1
 (2)

          E
       -> E * E
       -> E + E * E
       -> 1 + E * E
       -> 1 + 2 * E
       -> 1 + 2 * 1

What is the consequence of ambiguity? In case of (1), we interpret "1 + 2 * 1" as 1 + (2 * 1). In case of (2), we interpret "1 + 2 * 1" as (1 + 2) * 1.

Unambiguity of a grammar is therefore important. Otherwise, compiler A (Oracle) might follow the interpretation (1) and compiler B (Microsoft) follows the interpretation (2). The same programm will be interpreted differently!

How to resolve the ambiguity for our example? In general, we assume that * binds tigher than +. This can be enforced via the following grammar.

E ::= E + T | T

T ::= T * F | F

F ::= N | (E)

Recall the above example.

          E
       -> E + T
       -> E + T * F
       -> T + T * F
       -> F + T * F
       -> F + F * F
       -> N + F * F
       -> 1 + F * F
       -> 1 + N * F
       -> 1 + 2 * F
       -> 1 + 2 * N
       -> 1 + 2 * 1

Interpretation is 1 + (2 * 1).

Parse trees

A derivation can be represented as a parse tree.

Idea:

Each rule consists of a left-hand side (non-terminal) and a right-hand side (mix of terminals and non-terminals).

We view each non-terminal as an intermediate node in a tree. Each terminal is a leaf node.

The left-hand side terminal is therefore the parent node. The childern are found on the right-hand side.

Consider

E ::= E + T

viewed as a tree

        E
      / | \
     E  +  T    

Example

Consider the derivation.

     E
  -> T
  -> F
  -> (E)
  -> (E + T)
  -> (T + T)
  -> (F + T)
  -> (F + F)
  -> (N + F)
  -> (N + N)
  -> (1 + 0)

Represented as a parse tree.

         E
         |
         T
         |
         F
       / | \
      (  E  )
       / | \
      E  +  T
      |     |
      T     F
      |     |
      F     N
      |     |
      N     0
      |
      1  

Concrete versus abstract parse trees

The above parse tree represents the concrete syntax where all details of the original expression remain visible. We refer to the above as a concrete parse tree.

In further transformation steps, not all (concrete) syntactic details are necessary. For example, intermediate steps like "E reduced to T" are not relevant. Furthermore, we can ignore parantheses as they are (implictely) represented by the structure of the tree.

The thus simplified tree is referred to as the abstract parse tree (aka abstract syntax tree = AST) and captures the abstract syntax of expressions.

         +
        / \
       1   0     

In the AST representation we furthermore make use of binary termainal symbols in intermediate nodes.

General Parser

Siehe Cocke–Younger–Kasami algorithm.

Cubic complexity.

Top-Down Parsing

Deterministic in linear time.

Generally, requires to rewrite the grammar to a specific form. This is not always possible. So, not every grammar is suitable for top-down parsing.

Top-down matcher for expresssions

We show a top-down matcher for our expression language. Some magic here which can not be explained within the given time frame. Instead of a parser (that produces an AST), we also only consider a matcher (that says yes/no if the expression is syntactically correct or not).

package main

import "fmt"

// Simple scanner/lexer

// Tokens
const (
    EOS   = 0 // End of string
    ZERO  = 1
    ONE   = 2
    TWO   = 3
    OPEN  = 4
    CLOSE = 5
    PLUS  = 6
    MULT  = 7
)

func scan(s string) (string, int) {
    for {
        switch {
        case len(s) == 0:
            return s, EOS
        case s[0] == '0':
            return s[1:len(s)], ZERO
        case s[0] == '1':
            return s[1:len(s)], ONE
        case s[0] == '2':
            return s[1:len(s)], TWO
        case s[0] == '+':
            return s[1:len(s)], PLUS
        case s[0] == '*':
            return s[1:len(s)], MULT
        case s[0] == '(':
            return s[1:len(s)], OPEN
        case s[0] == ')':
            return s[1:len(s)], CLOSE
        default: // simply skip everything else
            s = s[1:len(s)]
        }

    }
}

type State struct {
    s   *string
    tok int
}

func next(s *State) {
    s2, tok := scan(*s.s)

    s.s = &s2
    s.tok = tok
}

// E  ::= T E'
func matchE(s *State) bool {
    if !matchT(s) {
        return false
    }
    if !matchE2(s) {
        return false
    }

    return true
}

// E' ::= + T E' |
func matchE2(s *State) bool {
    if s.tok == PLUS {
        next(s)
        if !matchT(s) {
            return false
        }
        if !matchE2(s) {
            return false
        }

    }

    return true
}

// T  ::= F T'
func matchT(s *State) bool {
    if !matchF(s) {
        return false
    }
    if !matchT2(s) {
        return false
    }
    return true
}

// T' ::= * F T' |
func matchT2(s *State) bool {
    if s.tok == MULT {
        next(s)
        if !matchF(s) {
            return false
        }
        if !matchT2(s) {
            return false
        }
    }
    return true
}

// F ::= N | (E)
func matchF(s *State) bool {
    switch {
    case s.tok == ZERO:
        next(s)
        return true
    case s.tok == ONE:
        next(s)
        return true
    case s.tok == TWO:
        next(s)
        return true
    case s.tok == OPEN:
        next(s)
        if !matchE(s) {
            return false
        }
        if s.tok != CLOSE {
            return false
        }
        next(s)
        return true
    }

    return false
}

func match(s string) bool {
    st := State{&s, EOS}
    next(&st)
    b := matchE(&st)
    if st.tok == EOS {
        return b
    }
    return false
}

func debug(s string) {
    fmt.Printf("%s", s)
}

func testMatcherBad() {
    fmt.Printf("\n BAD \n")
    fmt.Printf("%b", match(""))
    fmt.Printf("%b", match("1 + "))
    fmt.Printf("%b", match("1 + *"))
    fmt.Printf("%b", match("( 1 "))
    fmt.Printf("%b", match(" )"))
    fmt.Printf("%b", match("1 )"))
    fmt.Printf("%b", match("( 1 ) )"))
}

func testMatcherGood() {
    fmt.Printf("\n GOOD \n")
    fmt.Printf("%b", match("1"))
    fmt.Printf("%b", match("1+0"))
    fmt.Printf("%b", match("1 * 2 "))
    fmt.Printf("%b", match(" (1) "))
    fmt.Printf("%b", match(" (1 * (2)) "))
    fmt.Printf("%b", match(" (1 + 2) * 0 "))
}

func main() {

    testMatcherBad()

    testMatcherGood()
    fmt.Printf("\n")

}

What is a (functional) parser?

A parser is a function which takes a string and yields a result and the remaining string.

type Parser func(s string) (interface{}, string, bool)

where interface{} represents the parsing result (e.g. some abstract syntax tree) and the returning string represents the remaining input. As parsing may fail, we also return a Boolean value to represent either success or failure.

Application of a parser on some input.

func parse(f Parser, s string) (interface{}, string, bool) {
    return f(s)
}

Basic combinators

The epsilon combinator which parses the empty string. We leave the input string untouched and report as a result the empty string.

func eps() Parser {
    return func(s string) (interface{}, string, bool) {
        return "", s, true
    }
}

A parser to accept a specific character (item).

func item(x byte) Parser {
    return func(s string) (interface{}, string, bool) {
        if len(s) == 0 {
            return 0, s, false
        } else if s[0] == x {
            return s[0], s[1:len(s)], true
        } else {
            return 0, s, false
        }
    }
}

Building higher-order combinators

Build a new parser by composition of an exisisting parser.

func (p1 Parser) alt(p2 Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return parse(p2, s)
        } else {
            return v, rest, o
        }

    }
}

func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return v, rest, o
        } else {
            return parse(p2(v), rest)
        }

    }

}

func (p1 Parser) conc(p2 Parser) Parser {
    return p1.seq(func(v interface{}) Parser { return p2 })
}

alt is left-biased, if the first argument (parser) fails, we try the alternative.

seq runs two parsers in sequence. Recall that as a side product of parsing we usually expect to obtain a parse tree. With parser combinators the result obtained can be arbitrary. Check the type of seq!

conc is a specialized combinator where we simply concatenate the two parsers, ignoring the result obtained from the first parser.

Some simple examples

As we largely ignore here the result produced by parsing, we write a matcher function which checks if our parser matches some input string.

func matcher(p Parser, s string) bool {
    _, rest, o := parse(p, s)
    return rest == "" && o
}


ex0 := item('c')
ex1 := item('a').alt(item('b'))
ex2 := ex0.conc(ex1)

More expressive parser combinators

Most parser combinator libraries provide further combinators so we can even deal with left-recursive, ambiguous and even beyond context-free grammars. For details see here https://en.wikipedia.org/wiki/Parser_combinator.

In fact, some of these more expressive combinators can be directly encoded in terms of the host language. In the following, we show how to encode Kleene star.

Consider the regular expression

a* c

which can also be defined in terms of the following CFG productions

A -> a A | c

Observation:

func acceptAsC() Parser {
    return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}

Some points to note.

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}

Go is a strictly evaluated language and the above would immediately lead to a stack overflow. Hence, we need to 'hide' the recursive call within a (lambda) function.

Aside. In the lazy language Haskell, the more direct encoding is possible because program parts are lazily evaluated (i.e. only when needed).

Summary

DSL (external):

M ::= 'c' | 'a' M

EDSL: {.go} func acceptAsC2() Parser { return item('c').alt(item('a').conc(acceptAsC2())) }~~~

Complete source code

package main

// Parser combinators https://en.wikipedia.org/wiki/Parser_combinator

import "fmt"

type Parser func(s string) (interface{}, string, bool)

func parse(f Parser, s string) (interface{}, string, bool) {
    return f(s)
}

func eps() Parser {
    return func(s string) (interface{}, string, bool) {
        return "", s, true
    }
}

func item(x byte) Parser {
    return func(s string) (interface{}, string, bool) {
        if len(s) == 0 {
            return 0, s, false
        } else if s[0] == x {
            return s[0], s[1:len(s)], true
        } else {
            return 0, s, false
        }
    }
}

func (p1 Parser) alt(p2 Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return parse(p2, s)
        } else {
            return v, rest, o
        }

    }
}

func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return v, rest, o
        } else {
            return parse(p2(v), rest)
        }

    }

}

func (p1 Parser) conc(p2 Parser) Parser {
    return p1.seq(func(v interface{}) Parser { return p2 })
}

func matcher(p Parser, s string) bool {
    _, rest, o := parse(p, s)
    return rest == "" && o
}

func acceptAsC() Parser {
    return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}



func main() {

      fmt.Printf("%b", matcher(acceptAsC(), "ac"))  
//    fmt.Printf("%b", matcher(acceptAsC2(), "ac"))
    

    ex0 := item('c')
    fmt.Printf("%b", matcher(ex0, "c"))
    fmt.Printf("%b", matcher(ex0, "cd"))

    ex1 := item('a').alt(item('b'))
    fmt.Printf("%b", matcher(ex1, "a"))
    fmt.Printf("%b", matcher(ex1, "b"))
    fmt.Printf("%b", matcher(ex1, "c"))
    fmt.Printf("%b", matcher(ex1, "ab"))

    ex2 := ex0.conc(ex1)
    fmt.Printf("%b", matcher(ex2, "cb"))
    fmt.Printf("%b", matcher(ex2, "ad"))        

}