Syntax Analysis

Compiler pipeline:

     Syntax Analysis         --->   Semantic Analysis  --->   Optimization    --> Codegen 

Ex1:    x + * z

       syntax error

Ex2:    x + true                   semantic error
                                  can't mix Int and Bool

Ex3:    f(1) + f(1)               all good assuming          let x = f(1)         Native, JVM, ...
                                  f of type Int -> Int       in x + x

Matching versus parsing.

Matching: Check if the program is syntactically valid.

Parsing: Match and transform the program (=string) into a more suitable representation for further analysis (AST = Abstract Syntax Tree).

Parsing is largely a solved problem.

There exists many parsing tools (e.g. consider yacc and ANTLR).

The state of the art tool ANTLR is based on the idea of top-down parsing.

ANTLR requires the user to specify the syntax (to be parsed) in terms of a form of context-free grammars (CFGs).

From the CFG, ANTLR produces a parser (Java program) that then parses the prescribed syntax. Hence, we say ANTLR is a parser-parser.

We do not consider a specific parsing tool. Rather, we study the main idea behind top-down parsing.

Top-down parser handwritten

Transform CFG rules into a suitable form. This may not always possible.
Turn CFG rules in suitable form into (Go) parsing functions.

The last step from CFG rules to Go is done by hand. Tools such as yacc and ANTLR automate this step.

Parser combinators

Use combinators to build a parser (parser combinators)
Combinators are fancy API (parsing) functions which hide much of the plumbing necessary to carry out parsing.
Like in case of a handwritten parser, we might need to bring CFG rules into a suitable form so that we can apply parser combinators.
There exists lots of parser combinator libraries for most programming languages. We we will implement our own parser combinator library in Go.
Compared to parsing tools such as yacc and ANTLR, the parser is embedded into the host language. So, parser combinators form an internal domain-specific language (DSL) (aka embedded DSL) whereas yacc and ANTLR are external DSLs.

Parser combinators are also a good example for

playing with higher-order functions,
example of an embedded (internal) domain-specific language (DSL), and
DSL separated from host language via types.

DSL versus GPL

GPL = general-purpose language (C, Java, Kotlin, Haskell, Go, Python, ...)

DSL = domain-specific language (SQL, Latex, bash, perl, ...)

DSLs are usually external languages. Come with their own syntax, interpreter, compiler, ... ("eco-system").

Internal DSLs (a.k.a. embedded DSLs aka APIs aka Frameworks). Make use of the host language (GPL) to provide similar functionality like a external DSL. Often easier to evolve and adapt. Can reuse existing eco-system. Might to be not as optimized as external DSL. Other issues, domain-specific error handling etc.

Context-free grammars (CFGs)

How to describe and verify that a sequence of symbols consisting of digits, +, * etc is a syntactically well-formed arithmetic expression?

Examples.

1, 2, ...

1 + 2

(1 + 3) * 2

...

Let's be more formal.

E is an expression if and only if (iff)

E is a number
E is either of the form E1+E2, E1*E2 where E1 and E2 are expressions.
E is of the form (E1) where E1 is an expression.

Observation. We can use (context-free grammar) rules to describe expressions.

E -> 0
E -> 1
...
...
E -> E + E
E -> E * E
E -> (E)

We refer to E -> E + E as a CFG rule.

Each rule consists of left-hand side and a right-hand side (separated by ->).

The symbols appearing on some left-hand side are non-terminal symbols. Symbols that only appear on right-hand sides are terminal symbols. The left-hand side consists of exactly one non-terminal symbol whereas the right-hand side may consist of an arbitrary sequences of non-terminal and terminal symbols. The right-hand side may also be (the) empty (string).

Rules are applied by replacing (reducing) the left-hand side by the right-hand side. For each CFG, there is a designated starting symbol. For our example, we assume the starting symbol E.

A derivation is a sequence of reduction steps where we assume that we always start with the starting symbol E.

A word is a sequence of terminal symbols. For example, consider 1 + 1 + 0.

A word is part of the language described by the CFG, if we find a derivation starting with E that ends with the word.

Consider the following derivation.

    E
 -> E + E           (by applying the CFG rule E -> E + E)
 -> E * E + E       (by applying the CFG rule E -> E * E on the 'first' E in 'E + E')
 -> E * E + 0       (by applying the CFG rule E -> 0 on the 'third' E in 'E * E + E')
 -> 1 * E + 0       (and so on)
 -> 1 * 1 + 0

We find that 1 * 1 + 0 is a word in the language described by the above CFG.

For further details see here.

Mixing CFG and regular expressions

The rules for numbers are tedious.

E -> N
N -> 0
...
N -> 9
N -> 0N
...
N -> 9N

Could be much simplified if we use regular expressions.

E -> ['0'-'9']+

where we make use of the following short-hands.

r+ = r r*.

['0'-'9'] is a short-form for '0'|...|'9'.

There are a number of further short-hands and notations to make context-free grammars more accessible. See Extended Backus–Naur form.

Extended Backus-Naur form (EBNF)

We loosely follow EBNF. For brevity, we only consider the numbers 0, 1 and 2.

 N ::= 0 | 1 | 2 

 E ::= N | (E) | E + E | E * E

N and E are non-terminal symbols and 0, 1, 2, (, ), + and * are terminal symbols. The above EBNF rules are simply a more compact representation of the CFG rules we have used so far. For example, in EBNF we may combine the right-hand sides of rules via | (alternative) if the left-hand sides are the same.

Ambiguous grammars

The EBNF grammar is ambiguous. Ambiguous means here that for the same word, we find two derivations using the above rules.

 (1) 
          E
       -> E + E
       -> N + E
       -> 1 + E
       -> 1 + E * E
       -> 1 + N * E
       -> 1 + 2 * E
       -> 1 + 2 * N
       -> 1 + 2 * 1

 (2)

          E
       -> E * E
       -> E + E * E
       -> 1 + E * E
       -> 1 + 2 * E
       -> 1 + 2 * 1

What is the consequence of ambiguity? In case of (1), we interpret "1 + 2 * 1" as 1 + (2 * 1). In case of (2), we interpret "1 + 2 * 1" as (1 + 2) * 1.

Unambiguity of a grammar is therefore important. Otherwise, compiler A (Oracle) might follow the interpretation (1) and compiler B (Microsoft) follows the interpretation (2). The same program will be interpreted differently!

How to resolve the ambiguity for our example? In general, we assume that * binds tighter than +. This can be enforced via the following grammar.

E ::= E + T | T

T ::= T * F | F

F ::= N | (E)

Recall the above example.

          E
       -> E + T
       -> E + T * F
       -> T + T * F
       -> F + T * F
       -> F + F * F
       -> N + F * F
       -> 1 + F * F
       -> 1 + N * F
       -> 1 + 2 * F
       -> 1 + 2 * N
       -> 1 + 2 * 1

Interpretation is 1 + (2 * 1).

CFGs and parse trees

A derivation, sequence of CFG rule applications, can be represented as a parse tree. Derivations are formal proofs to verify for example that 1 + (2 * 1) is word in the language denoted by our running CFG.

Parse trees are compact representations for such proofs and are more suitable compared to derivations for processing in a programming language.

From derivations to parse trees:

Each rule consists of a left-hand side (non-terminal) and a right-hand side (mix of terminals and non-terminals).

We view each non-terminal as an intermediate node in a tree. Each terminal is a leaf node.

The left-hand side terminal is therefore the parent node. The children are found on the right-hand side.

Consider

E ::= E + T

viewed as a tree

        E
      / | \
     E  +  T

Example

Consider the derivation.

     E
  -> T
  -> F
  -> (E)
  -> (E + T)
  -> (T + T)
  -> (F + T)
  -> (F + F)
  -> (N + F)
  -> (N + N)
  -> (1 + 0)

Represented as a parse tree.

Concrete versus abstract parse trees

The above parse tree represents the concrete syntax where all details of the original expression remain visible. We refer to the above as a concrete parse tree.

In further transformation steps, not all (concrete) syntactic details are necessary. For example, intermediate steps like "E reduced to T" are not relevant. Furthermore, we can ignore parentheses as they are (implicitly) represented by the structure of the tree.

The thus simplified tree is referred to as the abstract parse tree (aka abstract syntax tree = AST) and captures the abstract syntax of expressions.

         +
        / \
       1   0

In the AST representation we furthermore make use of binary terminal symbols in intermediate nodes.

Parsing

Given some set of CFG rules.

Given some word.

Check if the word is part of the language denoted by the CFG. If yes, produce a parse tree.

General Parser

See Cocke–Younger–Kasami algorithm.

Cubic complexity.

Top-Down Parsing

Deterministic in linear time.

Generally, requires to rewrite the grammar to a suitable form. This is not always possible. So, not every grammar is suitable for top-down parsing.

Example

Recall

 N ::= 0 | 1 | 2 

 E ::= N | (E) | E + E | E * E

This grammar is not suitable for parsing because the grammar is ambiguous. See above discussion.

The grammar below is unambigous and accepts the same set of words (language) as the above grammar.

E ::= E + T | T

T ::= T * F | F

F ::= N | (E)

N ::= 0 | 1 | 2

The idea of top-down parsing:

Predict the right-hand side by considering the next input symbol

For the above grammar, this is tricky because of left-recursion.

Consider

E ::= E + T | T

How to predict if we should favor the right-hand side E + T or the favor the right-hand side T?

We might have to backtrack! This is bad because backtracking commonly implies some exponential run-time.

What to do? The idea is to (yet again) rewrite the grammar and remove the left-recursion.

Here is the resulting grammar.

E  ::= T E2

E2 ::= + T E2 |

T  ::= F T2

T2 ::= * F T2 |

F ::= N | (E)

N ::= 0 | 1 | 2

How to derive a top-down parser based on this grammar? Each left-hand side corresponds to a function where the function body is defined by the right-hand side. For details, see below.

Top-down matcher for expressions

We give a top-down matcher for our expression language. Instead of a parser (that produces an AST), we only consider a matcher (that says yes/no if the expression is syntactically correct or not).

package main

import "fmt"

// Simple scanner/lexer

// Tokens
const (
    EOS   = 0 // End of string
    ZERO  = 1
    ONE   = 2
    TWO   = 3
    OPEN  = 4
    CLOSE = 5
    PLUS  = 6
    MULT  = 7
)

func scan(s string) (string, int) {
    for {
        switch {
        case len(s) == 0:
            return s, EOS
        case s[0] == '0':
            return s[1:len(s)], ZERO
        case s[0] == '1':
            return s[1:len(s)], ONE
        case s[0] == '2':
            return s[1:len(s)], TWO
        case s[0] == '+':
            return s[1:len(s)], PLUS
        case s[0] == '*':
            return s[1:len(s)], MULT
        case s[0] == '(':
            return s[1:len(s)], OPEN
        case s[0] == ')':
            return s[1:len(s)], CLOSE
        default: // simply skip everything else
            s = s[1:len(s)]
        }

    }
}

type State struct {
    s   *string
    tok int
}

func next(s *State) {
    s2, tok := scan(*s.s)

    s.s = &s2
    s.tok = tok
}

// E  ::= T E2
func matchE(s *State) bool {
    if !matchT(s) {
        return false
    }
    if !matchE2(s) {
        return false
    }

    return true
}

// E2 ::= + T E2 |
func matchE2(s *State) bool {
    if s.tok == PLUS {
        next(s)
        if !matchT(s) {
            return false
        }
        if !matchE2(s) {
            return false
        }

    }

    return true
}

// T  ::= F T2
func matchT(s *State) bool {
    if !matchF(s) {
        return false
    }
    if !matchT2(s) {
        return false
    }
    return true
}

// T2 ::= * F T2 |
func matchT2(s *State) bool {
    if s.tok == MULT {
        next(s)
        if !matchF(s) {
            return false
        }
        if !matchT2(s) {
            return false
        }
    }
    return true
}

// F ::= N | (E)
func matchF(s *State) bool {
    switch {
    case s.tok == ZERO:
        next(s)
        return true
    case s.tok == ONE:
        next(s)
        return true
    case s.tok == TWO:
        next(s)
        return true
    case s.tok == OPEN:
        next(s)
        if !matchE(s) {
            return false
        }
        if s.tok != CLOSE {
            return false
        }
        next(s)
        return true
    }

    return false
}

func match(s string) bool {
    st := State{&s, EOS}
    next(&st)
    b := matchE(&st)
    if st.tok == EOS {
        return b
    }
    return false
}

func debug(s string) {
    fmt.Printf("%s", s)
}

func testMatcherBad() {
    fmt.Printf("\n BAD \n")
    fmt.Printf("%b", match(""))
    fmt.Printf("%b", match("1 + "))
    fmt.Printf("%b", match("1 + *"))
    fmt.Printf("%b", match("( 1 "))
    fmt.Printf("%b", match(" )"))
    fmt.Printf("%b", match("1 )"))
    fmt.Printf("%b", match("( 1 ) )"))
}

func testMatcherGood() {
    fmt.Printf("\n GOOD \n")
    fmt.Printf("%b", match("1"))
    fmt.Printf("%b", match("1+0"))
    fmt.Printf("%b", match("1 * 2 "))
    fmt.Printf("%b", match(" (1) "))
    fmt.Printf("%b", match(" (1 * (2)) "))
    fmt.Printf("%b", match(" (1 + 2) * 0 "))
}

func main() {

    testMatcherBad()

    testMatcherGood()
    fmt.Printf("\n")

}

Top-down parser for expressions

We adjust our parser to produce an AST. Instead of bool, each parsing function reports in addition some AST. There are several ways to represent ASTs. We choose to use Go's type language to use an AST that resembles algebraic data types as for example found in Haskell.

The parser can be pretty much straightforwardly derived from the matcher. There is one tricky case.

Consider

E  ::= T E2

E2 ::= + T E2 |

E returns an AST. The same applies to E2. To do so, we need to supply the left operand for +T. That is why the parser function for E2 takes as input T's AST and then returns E2's AST. Details are below.

package main

import "fmt"
import "strconv"

// AST

type Exp interface {
    pretty() string
    eval() int
}

type Num int
type Mult [2]Exp
type Plus [2]Exp

// pretty print

func (x Num) pretty() string {
    return strconv.Itoa(int(x))
}

func (e Mult) pretty() string {

    var x string
    x = "("
    x += e[0].pretty()
    x += "*"
    x += e[1].pretty()
    x += ")"

    return x
}

func (e Plus) pretty() string {

    var x string
    x = "("
    x += e[0].pretty()
    x += "+"
    x += e[1].pretty()
    x += ")"

    return x
}

// eval

func (x Num) eval() int {
    return int(x)
}

func (e Mult) eval() int {

    return e[0].eval() * e[1].eval()
}

func (e Plus) eval() int {

    return e[0].eval() + e[1].eval()
}

// Simple scanner/lexer

// Tokens
const (
    EOS   = 0 // End of string
    ZERO  = 1
    ONE   = 2
    TWO   = 3
    OPEN  = 4
    CLOSE = 5
    PLUS  = 6
    MULT  = 7
)

func scan(s string) (string, int) {
    for {
        switch {
        case len(s) == 0:
            return s, EOS
        case s[0] == '0':
            return s[1:len(s)], ZERO
        case s[0] == '1':
            return s[1:len(s)], ONE
        case s[0] == '2':
            return s[1:len(s)], TWO
        case s[0] == '+':
            return s[1:len(s)], PLUS
        case s[0] == '*':
            return s[1:len(s)], MULT
        case s[0] == '(':
            return s[1:len(s)], OPEN
        case s[0] == ')':
            return s[1:len(s)], CLOSE
        default: // simply skip everything else
            s = s[1:len(s)]
        }

    }
}

type State struct {
    s   *string
    tok int
}

func next(s *State) {
    s2, tok := scan(*s.s)

    s.s = &s2
    s.tok = tok
}

// E  ::= T E2
func parseE(s *State) (bool, Exp) {
    b, e := parseT(s)
    if !b {
        return false, e
    }
    return parseE2(s, e)
}

// E2 ::= + T E2 |
func parseE2(s *State, e Exp) (bool, Exp) {
    if s.tok == PLUS {
        next(s)
        b, f := parseT(s)
        if !b {
            return false, e
        }
        t := (Plus)([2]Exp{e, f})
        return parseE2(s, t)
    }

    return true, e
}

// T  ::= F T2
func parseT(s *State) (bool, Exp) {
    b, e := parseF(s)
    if !b {
        return false, e
    }
    return parseT2(s, e)
}

// T2 ::= * F T2 |
func parseT2(s *State, e Exp) (bool, Exp) {
    if s.tok == MULT {
        next(s)
        b, f := parseF(s)
        if !b {
            return false, e
        }
        t := (Mult)([2]Exp{e, f})
        return parseT2(s, t)
    }
    return true, e
}

// F ::= N | (E)
func parseF(s *State) (bool, Exp) {
    switch {
    case s.tok == ZERO:
        next(s)
        return true, (Num)(0)
    case s.tok == ONE:
        next(s)
        return true, (Num)(1)
    case s.tok == TWO:
        next(s)
        return true, (Num)(2)
    case s.tok == OPEN:
        next(s)
        b, e := parseE(s)
        if !b {
            return false, e
        }
        if s.tok != CLOSE {
            return false, e
        }
        next(s)
        return true, e
    }

    return false, (Num)(0)
}

func parse(s string) Exp {
    st := State{&s, EOS}
    next(&st)
    _, e := parseE(&st)
    if st.tok == EOS {
        return e
    }
    return (Num)(0) // dummy value
}

func debug(s string) {
    fmt.Printf("%s", s)
}

func test(s string) {
    e := parse(s)
    fmt.Printf("\n %s", e.pretty())
    fmt.Printf("\n %d", e.eval())
}

func testParserGood() {
    fmt.Printf("\n GOOD \n")
    test("1")
    test("1+0")
    test("1 * 2 ")
    test(" (1) ")
    test(" (1 * (2)) ")
    test(" (1 + 2) * 0 ")
}

// Helper functions to build ASTs by hand

func number(x int) Exp {
    return Num(x)
}

func plus(x, y Exp) Exp {
    return (Plus)([2]Exp{x, y})

    // The type Plus is defined as the two element array consisting of Exp elements.
    // Plus and [2]Exp are isomorphic but different types.
    // We first build the AST value [2]Exp{x,y}.
    // Then cast this value (of type [2]Exp) into a value of type Plus.

}

func mult(x, y Exp) Exp {
    return (Mult)([2]Exp{x, y})
}

func examplesAST() {
    ast1 := plus(mult(number(1), number(2)), number(0))

    fmt.Printf("%s", ast1.pretty())

}

func main() {
    examplesAST()

    fmt.Printf("\n")
    testParserGood()
}

What is a (functional) parser?

A parser is a function which takes a string and yields a result and the remaining string.

type Parser func(s string) (interface{}, string, bool)

where interface{} represents the parsing result (e.g. some abstract syntax tree) and the returning string represents the remaining input. As parsing may fail, we also return a Boolean value to represent either success or failure.

Application of a parser on some input.

func parse(f Parser, s string) (interface{}, string, bool) {
    return f(s)
}

Basic combinators

The epsilon combinator which parses the empty string. We leave the input string untouched and report as a result the empty string.

func eps() Parser {
    return func(s string) (interface{}, string, bool) {
        return "", s, true
    }
}

A parser to accept a specific character (item).

func item(x byte) Parser {
    return func(s string) (interface{}, string, bool) {
        if len(s) == 0 {
            return 0, s, false
        } else if s[0] == x {
            return s[0], s[1:len(s)], true
        } else {
            return 0, s, false
        }
    }
}

Building higher-order combinators

Build a new parser by composition of an existing parser.

func (p1 Parser) alt(p2 Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return parse(p2, s)
        } else {
            return v, rest, o
        }

    }
}

func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return v, rest, o
        } else {
            return parse(p2(v), rest)
        }

    }

}

func (p1 Parser) conc(p2 Parser) Parser {
    return p1.seq(func(v interface{}) Parser { return p2 })
}

alt is left-biased, if the first argument (parser) fails, we try the alternative.

seq runs two parser in sequence. Recall that as a side product of parsing we usually expect to obtain a parse tree. With parser combinators the result obtained can be arbitrary. Check the type of seq!

conc is a specialized combinator where we simply concatenate the two parser, ignoring the result obtained from the first parser.

Some simple examples

As we largely ignore here the result produced by parsing, we write a matcher function which checks if our parser matches some input string.

func matcher(p Parser, s string) bool {
    _, rest, o := parse(p, s)
    return rest == "" && o
}


ex0 := item('c')
ex1 := item('a').alt(item('b'))
ex2 := ex0.conc(ex1)

More expressive parser combinators

Most parser combinator libraries provide further combinators so we can even deal with left-recursive, ambiguous and even beyond context-free grammars. For details see here https://en.wikipedia.org/wiki/Parser_combinator.

In fact, some of these more expressive combinators can be directly encoded in terms of the host language. In the following, we show how to encode Kleene star.

Consider the regular expression

a* c

which can also be defined in terms of the following CFG productions

A -> a A | c

Observation:

Our host language Go supports recursive function.
Encode the Kleene star via a recursive function yielding a parser for the above grammar.

func acceptAsC() Parser {
    return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}

Some points to note.

Symbol c tells us to stop, hence, is tried first. Recall that alt is left-biased.
We can NOT write (the more direct) combinator program

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}

Go is a strictly evaluated language and the above would immediately lead to a stack overflow. Hence, we need to 'hide' the recursive call within a (lambda) function.

Aside. In the lazy language Haskell, the more direct encoding is possible because program parts are lazily evaluated (i.e. only when needed).

Summary

Parser combinators are domain-specific language embedded into some host language (here Go).
Our host language Go is strongly type, so we can statically guarantee that our parser are "well-formed" (a parser is a composition of existing parser).
We make use of recursion available in the host language to extend the expressiveness of our parser combinator language (see encoding of Kleene star).

DSL (external):

M ::= 'c' | 'a' M

EDSL:

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}

Complete source code

package main

// Parser combinators https://en.wikipedia.org/wiki/Parser_combinator

import "fmt"

type Parser func(s string) (interface{}, string, bool)

func parse(f Parser, s string) (interface{}, string, bool) {
    return f(s)
}

func eps() Parser {
    return func(s string) (interface{}, string, bool) {
        return "", s, true
    }
}

func item(x byte) Parser {
    return func(s string) (interface{}, string, bool) {
        if len(s) == 0 {
            return 0, s, false
        } else if s[0] == x {
            return s[0], s[1:len(s)], true
        } else {
            return 0, s, false
        }
    }
}

func (p1 Parser) alt(p2 Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return parse(p2, s)
        } else {
            return v, rest, o
        }

    }
}

func (p1 Parser) seq(p2 func(interface{}) Parser) Parser {
    return func(s string) (interface{}, string, bool) {
        v, rest, o := parse(p1, s)
        if !o {
            return v, rest, o
        } else {
            return parse(p2(v), rest)
        }

    }

}

func (p1 Parser) conc(p2 Parser) Parser {
    return p1.seq(func(v interface{}) Parser { return p2 })
}

func matcher(p Parser, s string) bool {
    _, rest, o := parse(p, s)
    return rest == "" && o
}

func acceptAsC() Parser {
    return item('c').alt(item('a').seq(func(v interface{}) Parser { return acceptAsC() }))
}

func acceptAsC2() Parser {
    return item('c').alt(item('a').conc(acceptAsC2()))
}



func main() {

      fmt.Printf("%b", matcher(acceptAsC(), "ac"))  
//    fmt.Printf("%b", matcher(acceptAsC2(), "ac"))
    

    ex0 := item('c')
    fmt.Printf("%b", matcher(ex0, "c"))
    fmt.Printf("%b", matcher(ex0, "cd"))

    ex1 := item('a').alt(item('b'))
    fmt.Printf("%b", matcher(ex1, "a"))
    fmt.Printf("%b", matcher(ex1, "b"))
    fmt.Printf("%b", matcher(ex1, "c"))
    fmt.Printf("%b", matcher(ex1, "ab"))

    ex2 := ex0.conc(ex1)
    fmt.Printf("%b", matcher(ex2, "cb"))
    fmt.Printf("%b", matcher(ex2, "ad"))        

}

yacc (bottom-up parser)

yacc = yet another compiler compiler

lex = lexer (tokenizer) based on regular expressions

bison/flex GNU variants of yacc/lex

yacc/lex require the following inputs

lexer file (.l) defines tokens (based on regular expressions)
parser File (.y) defines CFG rules (must be suitable for bottom-up parsing

Outputs are C programs that represent the (bottom-up) parser.

Lexer

// calc2.l
%{
#include "calc2.tab.h"
%}
%%
"+" {return(PLUS);}
"*" {return(TIMES);}
"(" {return(LPAREN);}
")" {return(RPAREN);}
[\n]    {return(CR);}
[0-9]+  {yylval=atoi(yytext); return(INT);}
[\t ]   ;
%%

Parser

// calc2.y
%{

/*  
All of the below productions that do not have associated 
actions are using the DEFAULT action -- $$ = $1 
*/

%}
%token PLUS TIMES INT CR RPAREN LPAREN

%%
lines   :   lines line
    |   line
    ;
line    :   expr CR         {printf("= %d\n",$1); }
    ;
expr    :   expr PLUS term      {$$ = $1 + $3; }
    |   term
    ;
term    :   term TIMES factor   {$$ = $1 * $3; }
    |   factor
    ;
factor  :   LPAREN expr RPAREN  {$$ = $2;}
    |   INT
    ;
%%

Productions = CFG rules
For each CFG rule there can be a semantic rule
Semantic rules uses attributes (and may make use of C code)
Attributes are connected to AST nodes where the shape of the AST corresponds the shape of the CFG
Consider

expr    :   expr PLUS term      {$$ = $1 + $3; }

$1 is an attribute connected to expr (on right-hand side)

$$ is an attribute connected to expr (on left-hand side)

yacc assumes synthesized attributes. These are attributes that can be computed based on the children attributes. Consider $$ = $1 + $3.

Syntesized attributes float up the AST whereas inherited attributes are pushed down the AST.

Compututation of synthesized attributes matches bottom-up parsing strategy.

`make` file to build the parser

%% Makefile

NAME    =   calc2

$(NAME):    $(NAME).tab.c lex.yy.c
    gcc $(NAME).tab.c lex.yy.c -o $(NAME) -ly -ll

$(NAME).tab.c:  $(NAME).y
    bison -vd $(NAME).y

lex.yy.c:   $(NAME).l
    flex $(NAME).l

Syntax Analysis: Top-down parser and parser combinators

Syntax Analysis

Top-down parser handwritten

Parser combinators

DSL versus GPL

Context-free grammars (CFGs)

Mixing CFG and regular expressions

Extended Backus-Naur form (EBNF)

Ambiguous grammars

CFGs and parse trees

Example

Concrete versus abstract parse trees

Parsing

General Parser

Top-Down Parsing

Example

Top-down matcher for expressions

Top-down parser for expressions

What is a (functional) parser?

Basic combinators

Building higher-order combinators

Some simple examples

More expressive parser combinators

Summary

Complete source code

yacc (bottom-up parser)

Lexer

Parser

make file to build the parser

`make` file to build the parser