Jan 18, 2018 - Ebnf

Comments

Long time, no see

Well, it's been longer than I would have liked since the last post. As usual in this industry, things always take longer than you expect. I'll not bore you with the details. So let's get on to the cool stuff.

EBNF

I've been writing about parsing for the last couple of posts. This always involves building up a grammar of some sort from smaller grammars and then passing the full grammar to a function to produce a parser. The point I've been trying to make, and will drive home in this post, is that the grammar itself is just a data structure. Deriving a parser from it is just one way to interpret it. Here are a couple of others.

But first, a digression.

It's convenient to have a way to write out a grammar in a concise, readable way that clearly specifies what strings are valid. The EBNF is a very useful tool to have. Click on the link to read up on it.

One thing you can do with a grammar in Toccata is generate the EBNF for it. This can be done with this library. You can take any grammar and pass it to the produce-ebnf function and it will spit one out. Using this library, the EBNF for the JSON grammar we looked at last time is

whitespace = { ' ' | '\t' | '\r' | '\n' | '\f' };
escaped chars = '\\', '\"' | '\\' | '/' | 'b' | 'f' | 'n' | 'r' | 't';
string = whitespace, '\"', { escaped chars | not '\"' }, '\"';
digit = '0' - '9';
integer = whitespace, ['-'], digit, { digit };
object = whitespace, '{',
         [{ string, whitespace, ':', whitespace, value, whitespace, ',', whitespace },
          string, whitespace, ':', whitespace, value],
         whitespace, '}';
array = whitespace, '[', [{ value, whitespace, ',', whitespace }, value], whitespace, ']';
value = string |
        integer |
        object |
        array |
        whitespace, 'true' |
        whitespace, 'false' |
        whitespace, 'null';

I'd originally intended to explain how the EBNF generator works, but as I reviewed it, I realized I would have to write about 5 posts and I have other things I want to get to. Maybe down the road, I'll come back to that. But the important thing I want to communicate is that a grammar in Toccata is just a data structure and they can be used in many different ways. For instance, here is the EBNF for the Toccata language itself. (I just realized I should put this in a documentation directory.) This document is not the specification of the parser, that's here. But it's generated automatically from the specification, so it's always accurate as of the time it was generated. How many other languages do you know that have an EBNF you can examine to see if an expression is valid? Most of the time, whatever the parser accepts is what's considered valid and there are almost always some weird corner cases.

A picture is worth ...

I forget when I discovered it, but this article was written in 2007 and deals with parsing, regular expressions and Finite State Automata. I can not recommend it, and the three that follow it, highly enough. For present purposes, I want to draw attention to the diagrams scattered throughout. These show the different states a parser can go through as it reads each character of the input string. These diagrams can be very illuminating in a way the textual representation of the EBNF isn't. This is similar to the Syntax Diagrams on the JSON.org home page. Except the Syntax Diagrams and EBNF show the structure of the grammar. The state machine graphs show the states of the parser. Wouldn't it be nice to be able to generate these graphs automatically?.

Using that library, here's the graph for the JSON grammar we've been looking at.

link

This did require some updates to the json.toc grammar. How it works is the 'grammar-graph' library converts the grammar to a Graphviz DOT file. This file is then fed into the dot command to produce an SVG or PNG image. Slick!

The Toccata code is located in the 'examples' directory alongside the 'json.toc' file:

Parse some sample JSON strings
Generate EBNF
Generate a state graph
(To produce an SVG file, "run examples/json-graph.toc | dot -Tsvg > json.svg")

So that's three ways we can interpret a grammar. The only limit to others is your imagination or need. For instance, it would be easy to generate a regex from a grammar.

An interesting thing happened on the way to the show

I have to say, the code to generate the graphs is not easy to grok. It's pretty deep and there are several requirements that all have to be satisfied simultaneously. A real nightmare. Which is why it's been a week (or more) since the last post. However, as I was writing that code and testing it against the JSON grammar, I noticed something.

link

Can you tell which state graph is incorrect? By looking at the state graph, I found a bug in my JSON grammar which I subsequently corrected. Having different ways to look at things is really useful.

Custom parsers

One final point. Being able to easily specify a grammar and generate a parser from it allows you to create custom parsers. Say you want to use JSON to serialize some data structures from your code. Instead of using a generic JSON parser library, you can write a custom parser that only accepts an object value and limits the key strings to small set of strings. This custom parser is probably going to be much faster than a generic parser that parses the JSON into a hash-map which you then have extract the fields from. And, depending on the parser generated from the custom grammar, could give more useful error messages when it fails.

We'll wrap it up there, but there's some exciting things just ahead. I'm writing these posts to show how Toccata can be used for practical purposes and we're going delve into another format next time.

Jan 10, 2018 - Parsing Json

Comments

The JSON Spec

As promised, we're going to write a JSON parser today. With a couple of important caveats which I'll cover later. Before reading further, open this link in another tab. It's the official json.org site and has the JSON grammar specified as syntax diagrams.

And now, let's get the boiler plate out of the way.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))
(add-ns fio (git-dependency "https://github.com/Toccata-Lang/file-io.git"
                            "file-io.toc"
                            :sha "e7a489b"))

Now for the actual grammar. I mean, how hard could it be to parse JSON?

The JSON grammar

Well actually, it's not that hard. We've already covered almost everything we need and with the addition of one more concept, we'll be golden. This post will mostly be a bunch of code with relatively little exposition.

When reading the JSON syntax diagrams, there's an assumption that is not explicitly stated; The use of whitespace to separate various pieces. So our first task is to account for that.

(def whitespace
  (grmr/none-or-more (grmr/any " " "\t" "\r" "\n" "\f")))

The first actual entity we want to describe are numbers. But since Toccata doesn't have support for floats yet, we're going to ignore them. A JSON integer can be either positive or negative, so we'll describe positive ones first, then account for negative ones.

(def integer-value (grmr/apply-fn (fn [negate digits]
                                    (let [magnitude (str-to-int (to-str digits))]
                                      (either (and negate
                                                   (maybe (* -1 magnitude)))
                                              magnitude)))
                                  (grmr/ignore whitespace)
                                  (grmr/optional "-")
                                  (grmr/one-or-more grmr/digit)))

Now let's look at character strings. We have to account for escaping special characters and that takes a lot of lines but mostly repetition.

(def escaped-doublequote
  (grmr/apply-fn (fn [& _] "\"")
                 "\""))

(def escaped-backslash
  (grmr/apply-fn (fn [& _] "\\")
                 "\\"))

(def escaped-slash
  (grmr/apply-fn (fn [& _] "/")
                 "/"))

(def escaped-backspace
  (grmr/apply-fn (fn [& _] "\b")
                 "b"))

(def escaped-formfeed
  (grmr/apply-fn (fn [& _] "\f")
                 "f"))

(def escaped-newline
  (grmr/apply-fn (fn [& _] "\n")
                 "n"))

(def escaped-return
  (grmr/apply-fn (fn [& _] "\r")
                 "r"))

(def escaped-tab
  (grmr/apply-fn (fn [& _] "\t")
                 "t"))

(def escaped-chars (grmr/all "\\" (grmr/any escaped-doublequote
                                            escaped-backslash
                                            escaped-slash
                                            escaped-backspace
                                            escaped-formfeed
                                            escaped-newline
                                            escaped-return
                                            escaped-tab)))

And now we can describe a JSON character string. The second caveat is that this does not handle escaped Unicode character sequences.

(def string-value (grmr/apply-fn identity
                                 (grmr/ignore whitespace)
                                 (grmr/apply-fn to-str
                                                (grmr/ignore "\"")
                                                (grmr/none-or-more (grmr/any escaped-chars
                                                                             (grmr/not-char "\"")))
                                                (grmr/ignore "\""))))

With the basics out of the way, we come to that new concept.

Both the 'object' and 'array' structures from the JSON spec can nest inside each other. So arrays can contain values which are arrays or objects and likewise for objects. But there's no way for a grammar definition to refer to itself by name because they are strictly values and must be fully specified before being assigned to a name. Officially, these data structures are recursive. Not only that, they are mutually recursive. If you read the spec for arrays, you see that they can contain any number of values. And a value can be a string, number, object, array, etc. Therefore, before describing the grammar for array, we have to describe value, but we can't describe value without array. What to do, what to do?

The solution is to 'forward declare' a placeholder for value and then replace it with the real grammar later. Here's how you do that.

(def value (grmr/recurse "value"))

The string passed to grmr/recurse must match the string we'll use later in defining value. It doesn't have to be the same as the symbol we're assigning this to, but that reduces confusion. If Toccata had a macro system (and it will eventually) this kind of boilerplate would be handled in some kind of defgrammar macro or something.

Now, let's describe the array grammar.

(def comma (grmr/all whitespace "," whitespace))

(def array
  (grmr/apply-fn flatten
                 (grmr/ignore whitespace)
                 (grmr/ignore "[")
                 (grmr/none-or-more
                  (grmr/apply-fn (fn [items last-item]
                                   (conj (flatten items) last-item))
                                 (grmr/none-or-more (grmr/all value
                                                              (grmr/ignore comma)))
                                 value))
                 (grmr/ignore whitespace)
                 (grmr/ignore "]")))

There's a few funky things to account for the peculuarities of the JSON array spec, but it works. I'd suggest putting print-errs in various places and parsing some JSON to gain a better understanding of what's going on.

The grammar for an object isn't much more complicated.

(def colon (grmr/all whitespace ":" whitespace))

(def key-value-pair (grmr/all string-value
                              (grmr/ignore colon)
                              value))

(def object
  (grmr/apply-fn (fn [kv-pairs]
                   (-> kv-pairs
                       flatten
                       (reduce {} (fn [m [k v]]
                                    (assoc m k v)))))
                 (grmr/ignore whitespace)
                 (grmr/ignore "{")
                 (grmr/none-or-more
                  (grmr/apply-fn (fn [items last-item]
                                   (conj (flatten items) last-item))
                                 (grmr/none-or-more (grmr/all key-value-pair
                                                              (grmr/ignore comma)))
                                 key-value-pair))
                 (grmr/ignore whitespace)
                 (grmr/ignore "}")))

And now, we need to fill in that placeholder for value.

(def value
  (grmr/recursive-rule "value"
                       (grmr/any string-value
                                 integer-value
                                 object
                                 array
                                 (grmr/apply-fn (fn [_ _] (maybe 'true))
                                                whitespace
                                                "true")
                                 (grmr/apply-fn (fn [_ _] nothing)
                                                whitespace
                                                "false")
                                 (grmr/apply-fn (fn [_ _] 'null)
                                                whitespace
                                                "null"))))

The first argument to grmr/recursive-rule must match the parameter passed to grmr/recurse above. In fact, they don't even have to be strings. Just two values that can be compared for equality. The second argument is the actual grammar that describes a value. In addition to what we've seen described above, there are 3 anonymous grammars for some JSON literal values.

And there you go, a grammar that describes JSON. Now create a parser from it.

(def parser (rd/parser value))

And you're good to go. I'll leave slurping in a file and parsing it for you to do on your own. Have fun!

EBNF

Syntax diagrams are not the only way to specify grammars. I'm planning on covering a different way next time. Along with something that I hope will be a mind blowing revelation. Or at least a little surprising.

Jan 6, 2018 - Config Files

Comments

We're rolling now

I don't know about you, but I'm really pleased at how this blog series is developing. Each post seems to be building on the last by adding a couple of new pieces. I would love to say I intended it that way, but I can't.

In this post, we're going to begin looking at file I/O in addition to parsing.

More than command lines

So far, we've seen how to parse the command line arguments passed to our main function. This could be extended to create a full blown arguments parsing library with long and short options, etc. But at some point, we're going to need more than what can be conveniently passed in on the command line. We're going to want to keep a file full of configuration information that will be read at each invocation and its contents used to inform our program's execution.

This turns out to be just a tiny step beyond what we've already seen. I'm of the opinion that config files should be relatively simple. If you need to use JSON or XML for nesting complex data structures, you're getting close to needing a full-blown DSL. And what you're writing is more of an interpreter than just a program with a config file. So for our purposes, we're going to limit the lines of a config file to

  • empty lines (only containg spaces and tabs)
  • comment lines beginning with '#'
  • value lines that have a name (alphabetic characters and '-') and a value (integers or strings delimited by '"')

and that's it. So let's do this.

The code

First, import the needed libraries

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))
(add-ns fio (git-dependency "https://github.com/Toccata-Lang/file-io.git"
                            "file-io.toc"
                            :sha "e7a489b"))

We've added the file I/O library to our list of imports. This lib contains some functions to do basic reads and writes of files.

Now let's start describing our grammar for the config file lines. Here's an empty line.

(def whitespace (grmr/one-or-more (grmr/any " "
                                            "\t")))

(def newline "\n")

(def empty-line (grmr/all (grmr/optional whitespace)
                          newline))

We start by declaring rules for white space and a new line character. Then define the empty-line rule.

Now, we declare a comment

(def not-newline (grmr/not-char "\n"))

(def comment (grmr/all (grmr/optional whitespace)
                       "#"
                       (grmr/none-or-more not-newline)
                       newline))

The grmr/not-char combinator does what it says. It matches any charactor other than the one given. Now, let's bring over name and integer from the last post.

(def name (map (grmr/one-or-more (grmr/any grmr/alpha
                                           "-"))
               to-str))

(def integer-value (map (grmr/one-or-more grmr/digit)
                        (fn [digits]
                          (str-to-int (to-str digits)))))

A new requirement is reading in strings delimited by double quotes. But we have all the tools to do this easily.

(def string-value (grmr/apply-fn to-str
                                 (grmr/ignore "\"")
                                 (grmr/none-or-more (grmr/not-char "\""))
                                 (grmr/ignore "\"")))

And now, the rubber meets the road. We need to combine these pieces to parse a config file line and create a data structure of the name string and config value. A moment's thought surfaces some requirements.

  • We don't know which names will be defined in the file
  • We don't know what order they will be defined in
  • We want to end up with a HashMap of all names to values in the file

There are multiple ways to do this. My preferred way is to put each name/value pair into a HashMap and then just compose all the maps after parsing all the lines.

(def config-line (grmr/apply-fn (fn [param val]
                                  {param val})
                                (grmr/ignore (grmr/optional whitespace))
                                name
                                (grmr/ignore whitespace)
                                (grmr/any integer-value
                                          string-value)
                                (grmr/ignore (grmr/optional whitespace))
                                (grmr/ignore newline)))

(side note: I left that anonymous function in to show the param and value. But replacing it with hash-map works just fine.)

The final piece is to pull it all together to declare a grammar that will parse an entire config file

(defn ignore [g]
  (map g (fn [_] {})))

(def config-file (map (grmr/none-or-more (grmr/any (ignore comment)
                                                   config-line
                                                   (ignore empty-line)))
                      (fn [config-lines]
                        (comp* {} config-lines))))

(def parser (rd/parser config-file))

Each line gets converted to a HashMap. But we need to ignore the comment and empty lines. So we define a 'higher order' grammar rule. (Really, it's just a function that takes a grammar rule and returns a modified version.) In this case, ignore takes a grammar a rule and returns a rule that always returns an empty HashMap upon a successful match. This is different from and does not replace grmr/ignore from the grammar library.

And then, it's straightforward to define our config file grammar. It's just none (since the file may be empty) or more lines. Each line may be either a comment or an empty line, which produces an empty HashMap. Or a configuration name/value pair.

This list of HashMaps is passed to the anonymous function as the parameter config-lines. Then we use the parametric form of comp, the comp* function to squash all the HashMap into a single HashMap. This also means that if a config name is defined multiple times in a file, only the last one will appear in the final HashMap.

All that's left is to read in the config file and parse it.

(main [_]
      (for [config-map (parser (fio/slurp "config.txt"))]
        (map (seq config-map) (fn [[name value]]
                                (println (str name ":") value)))))

I'm going to leave the explanation of the main function out. You should be able to read this if you've read the previous posts. I will say that fio/slurp just pulls the entire contents of a file into a string in one shot.

So, if you have a config file that looks like this

# this is a comment
some-config    19
thread-count     100
str-val         "some string"
url             "http://another.com"

#              another comment

When you run this program, you should get

url: http://another.com
str-val: some string
some-config: 19
thread-count: 100

Though the order of the lines may be different.

Better than a library

In other languages, there might be a library or package that you would import to handle config files. Open source is great, but when you pull in a package like that, you take on the responsibility for keeping it integrated with your application and that comes at a cost. What I've tried to show here is how easy it is to have custom software to do exactly what you need.

For example, it would be trivial to only allow certain names to be in the config file. Or add different kinds of values. I think this is much superior to importing a bunch of libraries to do relatively simple tasks.

But ...

What if you really do want to parse JSON?

That's up next