Apr 27, 2015 - Finishing Grammars

Comments

Let's get right to it.

Terminals

The simplest grammar rule is a terminal string. (from the parser namespace)

(deftype parser-terminal [term-str])

(defn term [term-str]
  (fr/free (parser-terminal term-str)))

Let's extend parser-terminal to implement ebnf.

(extend-type parser-terminal
  Make-EBNF
  (ebnf [terminal]
    (EBNF (str "'" (.term-str terminal) "'") {})))

As you can see, when you understand how free and evaluate work, it's super easy to add other types of values to be included in the tree data structure.

Repetition

There are two forms of rule repetition. one-or-more and none-or-more

(deftype repeat-rule [rule])

(defn one-or-more [rule]
  (fr/free (repeat-rule rule)))

(deftype none-or-more-rule [rule])

(defn none-or-more [rule]
  (fr/free (none-or-more-rule rule)))

Extending our EBNF generator for them is pretty straight forward

(extend-type p/repeat-rule
  Make-EBNF
  (ebnf [r]
    (let [rule-body (fr/evaluate (.rule r) ebnf)]
      (EBNF (str (.ebnf-str rule-body) ", { " (.ebnf-str rule-body) " }")
            (.rules rule-body)))))

(extend-type p/none-or-more-rule
  Make-EBNF
  (ebnf [r]
    (let [rule-body (fr/evaluate (.rule r) ebnf)]
      (EBNF (str "{ " (.ebnf-str rule-body) " }")
            (.rules rule-body)))))

Options

Presented without further comment

(deftype optional-rule [rule])

(defn optional [rule]
  (fr/free (optional-rule rule)))


(extend-type p/optional-rule
  Make-EBNF
  (ebnf [r]
    (let [rule-body (fr/evaluate (.rule r) ebnf)]
      (EBNF (str "[ " (.ebnf-str rule-body) " ]")
            (.rules rule-body)))))

Cherry on top

(main [_]
      (let [rules (seq (.rules (fr/evaluate (expression) ebnf)))]
        (map rules (fn [rule]
                     (println (first rule) "=" (second rule) ";")))))

evaluate the grammar, pull the rules out and make list of it. Then map over the list, printing each rule name and body. When run it produces as output

letter = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j'
| 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u'
| 'v' | 'w' | 'x' | 'y' | 'z'
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
symbol = letter, { digit | letter }
number = [ '-' ], digit, { digit }
expression = '( ', { symbol | number | expression }, ' )'

Full code for the EBNF generator is here.

But is it practical?

You might be thinking "That's a lot of work. What's the payoff?" Toccata's grammar and parser is specified using exactly this code. The formal grammar is here. And the official EBNF is here. It's a little more complicated for performance reasons, but Toccata has an easily consumed formal specification for determining which strings comprise a valid expression. As an enlightening exercise, see if you can find the formal grammars for other popular programming languages. While they do exist, too often the only spec is whatever the language's parser accepts. And that can lead to weird corner cases and makes it nigh-on-impossible to correctly port. You could take Toccata's EBNF and (almost) drop it into Instaparse to get a parser for Toccata. And you could just as easily generate as many strings as you wanted to test that parser. There's not many other languages that you can say that about.

And if you're curious about the code generate the full Toccata EBNF, it's here

Apr 25, 2015 - Interpreting Grammars

Comments

Picking up the trail

We left off last time having created a simple grammar.

(defn letter []
  (p/rule "letter"
          (p/one-of "abcdefghijklmnopqrstuvwxyz")))

(defn digit []
  (p/rule "digit"
          (p/one-of "0123456789")))

(defn number []
  (p/rule "number"
          (apply-to str
                    (p/optional (p/term "-"))
                    (p/one-or-more (digit)))))

(defn symbol []
  (p/rule "symbol"
          (apply-to str
                    (letter)
                    (p/none-or-more (comp (digit) (letter))))))

(defn sub-expression []
  (p/rule "expression"
          (p/term "")))

(defn expression []
  (p/rule "expression"
          (apply-to str
                    (p/term "( ")
                    (p/none-or-more (comp (symbol) (number)
                                          (sub-expression)))
                    (p/term " )"))))

Now let's do something with the data structure that expression creates.

EBNF

If the grammar is just a data structure, why didn't we print it? Well, to print something is to interpret it as a string of characters. But there's no common way of interpreting free values, which is what a grammar is.

One way to interpret grammars (and to specify them) is using Extended Backus-Naur Form. This notation is a succinct representation of grammars, so let's see about converting our grammar into EBNF.

EBNF let's you define rules that look like

letter = 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z';
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
symbol = letter, { digit | letter };
number = [ '-' ], digit, { digit };
expression = '( ', { symbol | number | expression }, ' )';

Which just so happens to correspond to our grammar. You'll see that rule names can used in other rules' definitions. Optional items are enclosed in [...]. Alternatives are linked with |, sequences are linked with , and repetitions are enclosed with {...}. Terminal strings are enclosed in '. So how do we go about converting the result of a call to expression to EBNF.

Evaluation

A few posts back, I talked about how to use the evaluate function from the core/free.toc module to interpret a free value and that's what we're going to do here.

The core concept is that evaluate walks the tree created using free, apply-to and comp. Anytime it encounters a value that isn't free, it will call a function to convert that value to the type we want for our final value. That type must also specify implementations of apply* and comp*.

For our grammar, our final value will be a hash map of rule names to rule bodies. As usual, there's a wrinkle or two that makes things a wee bit more complicated. But not too much.

Starting at the top

Let's start by looking at expression

(defn expression []
  (p/rule "expression"
          (apply-to str
                    (p/term "( ")
                    (p/none-or-more (comp (symbol) (number)
                                          (sub-expression)))
                    (p/term " )"))))

It creates a rule value with two parameters; the string "expression", which is the name of the rule, and the grammar for an expression.

An expression is defined as an opening paren ( followed by zero or more items that are one of a symbol, number or sub-expression, and then a closing paren ). So let's look at the rule function from the parser namespace.

(deftype parser-rule [name grammar])

(defn rule [name grammar]
  (fr/free (parser-rule name grammar)))

We see that it creates a free value that wraps a value of type parser-rule, which holds the rule name and the rule body grammar.

So when we write

(fr/evaluate (expression) ebnf)

ebnf must be a function that, at a minimum takes a parser-rule value and produces an EBNF rule. Here's the code for evaluate.

(deftype free [v]
  FreeEval
  (evaluate [free-val eval-free]
    (eval-free v)))

Going deeper

The interesting thing to see is that the grammar symbol inside the parser-rule value is itself a free value that wraps a grammar tree data structure. In the case of expression, this value was created by a call to apply-to. So, we're going to recurse down the tree by calling evaluate.

ebnf is going to be called for the term and the none-or-more values, so it's going to have to do different things depending of the type of parameter it's called with. So it will need to be a protocol function.

(defprotocol Make-EBNF
  (ebnf [grammar]))

And parser-rule will need to be extended to implement ebnf.

(extend-type p/parser-rule
  Make-EBNF
  (ebnf [r]
    (let [rule-body (fr/evaluate (.grammar r) ebnf)]
      (EBNF (.name r)
            (assoc (.rules rule-body)
              (.name r) (.ebnf-str rule-body))))))

Getting a result

What's up with that call to EBNF? Well, it's a type constructor.

(deftype EBNF [ebnf-str rules]
  Applicative
  (apply* [_ vs]
    (EBNF (apply str (interpose (map vs .ebnf-str) ", "))
          (apply comp (map vs .rules))))

  Monoid
  (comp* [v vs]
    (let [vs (cons v vs)]
      (EBNF (apply str (interpose (map vs .ebnf-str) " | "))
            (apply comp (map vs .rules))))))

As you can see, it implements apply* and comp* to compose ebnf fragments either sequentially or as alternatives. So there we have the type of data that will ultimately hold our completed EBNF grammer in the rules field of an EBNF value.

But why?

If all this seems overly complicated, realize that we have totally decomposed the complexity specifying and interpreting a grammar. "Separation of concerns" is often extolled as something to strive for in order to keep code maintainable and flexible. For instance, the code to free and evaluate is in a core Toccata namespace, so there probably aren't any bugs in it. No need to reimplement tree construction and walking. And the various kinds of grammar combinators are self-contained data types. Adding new ones is almost trivial and has no impact on the existing types. The complete code for our simple EBNF generator is here.

While there's a lot of verbiage to explain it, the code itself is pretty concise.

Finally, I'll give a peek down the road. Just printing an EBNF of a grammar isn't all that useful. What you really want is a function that will parse a string according to a particular grammar. That's as easy as

(fr/evaluate (expression) recursive-descent)

which generates a recursive descent parser for our little grammar. And you could generate 1000 strings to test the parser

(take 1000 (fr/evaluate (expression) generate-string))

Hold that thought

This post is already over-limit, so we'll look at the other rule combinators next time.

Apr 23, 2015 - Grammars

Comments

Picking up the trail

If you've been following along, you know that in a previous post, I talked about generating an SVG image given a tree data structure created using free values. (if not, check the archives)

We're going to pick up that thread again this time.

Comp'ing

We've seen that the free values implement the wrap and apply* functions of the Applicative protocol. This let us use apply-to to build tree-shaped data structures that can then be evaluated. Well, the fun doesn't stop there. free values also implement the comp* function of the Monoid protocol.

This means that two free values can be combined to form another, larger, free value using comp. So now we have two ways to combine free values.

Grammars

Which is very convenient, because many, many problems can be expressed as trees with two different kinds of branching. One that is used in Toccata is for the construction of the formal grammar for the language.

A grammar is a specification that tells you exactly what strings are valid for a language. They can be specified in many ways. If you look at this file, you'll find Toccata's grammar specified as Toccata code. This code uses functions from the parser namespace to define the grammar. Grammars are built from rules which are combined and modified in different ways. Checking the validity of a string against a grammar rule is called 'parsing'.

The simplest rule is defined with the term function

(defn term [term-str]
  (fr/free (parser-terminal term-str)))

As you can see, it takes a string and returns a parser-terminal value wrapped in a free value. A terminal rule says that the only valid string in the grammar is term-str.

(term "howdy")

Sequences

The first rule composition we'll look is combining them so that one follows another. This is done using apply-to.

(apply-to str (term "a") (term "b"))

This rule says that a valid string is an "a" followed by a "b", or "ab".

Alternatives

The second form of composition is specifying alternatives using comp.

(comp (term "a") (term "b"))

This rules says that there are two valid strings, "a" or "b".

Options

A rule can be made optional as well.

(apply-to str (optional (term "a")) (term "b"))

means that valid strings are "b" and "ab".

Repetition

There are two ways to indicate that a rule may be applied a number of times

(one-or-more (term "a"))

and

(none-or-more (term "a"))

The first says that a string with any number of "a"'s is valid. The second says the same thing but also that the empty string is valid.

Simple Grammar

With that brief explanation, hopefully the following simple grammar makes sense.

(defn letter []
  (p/rule "letter"
          (p/one-of "abcdefghijklmnopqrstuvwxyz")))

(defn digit []
  (p/rule "digit"
          (p/one-of "0123456789")))

(defn number []
  (p/rule "number"
          (apply-to str
                    (p/optional (p/term "-"))
                    (p/one-or-more (digit)))))

(defn symbol []
  (p/rule "symbol"
          (apply-to str
                    (letter)
                    (p/none-or-more (comp (digit) (letter))))))

(defn sub-expression []
  (p/rule "expression"
          (p/term "")))

(defn expression []
  (p/rule "expression"
          (apply-to str
                    (p/term "( ")
                    (p/none-or-more (comp (symbol) (number)
                                          (sub-expression)))
                    (p/term " )"))))

The one-of rule says that any of the characters of the given string are valid. It's a shorthand way of using comp. The top level rule of the grammar is expression. Because expressions can be nested inside other expressions and given the current limitations of Toccata, that recursive structure has to be broken up with the sub-expression rule. So, a valid string for this grammar is

"(abc (inside 9 (() more inside)) 123)"

Data Only

Thus far, we've only created a data structure and bound it to the symbol expression. To give meaning to that tree structure, we have to use the evaluate function from the core/free.toc namespace. There's an amazing amount of power that we'll unlock next time. For now, consider that the core/parser.toc namespace provides a DSL for describing grammars. And that the interpretation of that structure is totally separated from it's creation.