Long time, no see

Well, it's been longer than I would have liked since the last post. As usual in this industry, things always take longer than you expect. I'll not bore you with the details. So let's get on to the cool stuff.


I've been writing about parsing for the last couple of posts. This always involves building up a grammar of some sort from smaller grammars and then passing the full grammar to a function to produce a parser. The point I've been trying to make, and will drive home in this post, is that the grammar itself is just a data structure. Deriving a parser from it is just one way to interpret it. Here are a couple of others.

But first, a digression.

It's convenient to have a way to write out a grammar in a concise, readable way that clearly specifies what strings are valid. The EBNF is a very useful tool to have. Click on the link to read up on it.

One thing you can do with a grammar in Toccata is generate the EBNF for it. This can be done with this library. You can take any grammar and pass it to the produce-ebnf function and it will spit one out. Using this library, the EBNF for the JSON grammar we looked at last time is

whitespace = { ' ' | '\t' | '\r' | '\n' | '\f' };
escaped chars = '\\', '\"' | '\\' | '/' | 'b' | 'f' | 'n' | 'r' | 't';
string = whitespace, '\"', { escaped chars | not '\"' }, '\"';
digit = '0' - '9';
integer = whitespace, ['-'], digit, { digit };
object = whitespace, '{',
         [{ string, whitespace, ':', whitespace, value, whitespace, ',', whitespace },
          string, whitespace, ':', whitespace, value],
         whitespace, '}';
array = whitespace, '[', [{ value, whitespace, ',', whitespace }, value], whitespace, ']';
value = string |
        integer |
        object |
        array |
        whitespace, 'true' |
        whitespace, 'false' |
        whitespace, 'null';

I'd originally intended to explain how the EBNF generator works, but as I reviewed it, I realized I would have to write about 5 posts and I have other things I want to get to. Maybe down the road, I'll come back to that. But the important thing I want to communicate is that a grammar in Toccata is just a data structure and they can be used in many different ways. For instance, here is the EBNF for the Toccata language itself. (I just realized I should put this in a documentation directory.) This document is not the specification of the parser, that's here. But it's generated automatically from the specification, so it's always accurate as of the time it was generated. How many other languages do you know that have an EBNF you can examine to see if an expression is valid? Most of the time, whatever the parser accepts is what's considered valid and there are almost always some weird corner cases.

A picture is worth ...

I forget when I discovered it, but this article was written in 2007 and deals with parsing, regular expressions and Finite State Automata. I can not recommend it, and the three that follow it, highly enough. For present purposes, I want to draw attention to the diagrams scattered throughout. These show the different states a parser can go through as it reads each character of the input string. These diagrams can be very illuminating in a way the textual representation of the EBNF isn't. This is similar to the Syntax Diagrams on the JSON.org home page. Except the Syntax Diagrams and EBNF show the structure of the grammar. The state machine graphs show the states of the parser. Wouldn't it be nice to be able to generate these graphs automatically?.

Using that library, here's the graph for the JSON grammar we've been looking at.


This did require some updates to the json.toc grammar. How it works is the 'grammar-graph' library converts the grammar to a Graphviz DOT file. This file is then fed into the dot command to produce an SVG or PNG image. Slick!

The Toccata code is located in the 'examples' directory alongside the 'json.toc' file:

Parse some sample JSON strings
Generate EBNF
Generate a state graph
(To produce an SVG file, "run examples/json-graph.toc | dot -Tsvg > json.svg")

So that's three ways we can interpret a grammar. The only limit to others is your imagination or need. For instance, it would be easy to generate a regex from a grammar.

An interesting thing happened on the way to the show

I have to say, the code to generate the graphs is not easy to grok. It's pretty deep and there are several requirements that all have to be satisfied simultaneously. A real nightmare. Which is why it's been a week (or more) since the last post. However, as I was writing that code and testing it against the JSON grammar, I noticed something.


Can you tell which state graph is incorrect? By looking at the state graph, I found a bug in my JSON grammar which I subsequently corrected. Having different ways to look at things is really useful.

Custom parsers

One final point. Being able to easily specify a grammar and generate a parser from it allows you to create custom parsers. Say you want to use JSON to serialize some data structures from your code. Instead of using a generic JSON parser library, you can write a custom parser that only accepts an object value and limits the key strings to small set of strings. This custom parser is probably going to be much faster than a generic parser that parses the JSON into a hash-map which you then have extract the fields from. And, depending on the parser generated from the custom grammar, could give more useful error messages when it fails.

We'll wrap it up there, but there's some exciting things just ahead. I'm writing these posts to show how Toccata can be used for practical purposes and we're going delve into another format next time.