The JSON Spec
As promised, we're going to write a JSON parser today. With a couple of important caveats which I'll cover later. Before reading further, open this link in another tab. It's the official json.org site and has the JSON grammar specified as syntax diagrams.
And now, let's get the boiler plate out of the way.
Now for the actual grammar. I mean, how hard could it be to parse JSON?
The JSON grammar
Well actually, it's not that hard. We've already covered almost everything we need and with the addition of one more concept, we'll be golden. This post will mostly be a bunch of code with relatively little exposition.
When reading the JSON syntax diagrams, there's an assumption that is not explicitly stated; The use of whitespace to separate various pieces. So our first task is to account for that.
The first actual entity we want to describe are numbers. But since Toccata doesn't have support for floats yet, we're going to ignore them. A JSON integer can be either positive or negative, so we'll describe positive ones first, then account for negative ones.
Now let's look at character strings. We have to account for escaping special characters and that takes a lot of lines but mostly repetition.
And now we can describe a JSON character string. The second caveat is that this does not handle escaped Unicode character sequences.
With the basics out of the way, we come to that new concept.
Both the 'object' and 'array' structures from the JSON spec can nest inside each other. So arrays can contain values which are arrays or objects and likewise for objects. But there's no way for a grammar definition to refer to itself by name because they are strictly values and must be fully specified before being assigned to a name. Officially, these data structures are
recursive. Not only that, they are mutually recursive. If you read the spec for arrays, you see that they can contain any number of
values. And a value can be a string, number, object, array, etc. Therefore, before describing the grammar for array, we have to describe value, but we can't describe value without array. What to do, what to do?
The solution is to 'forward declare' a placeholder for value and then replace it with the real grammar later. Here's how you do that.
The string passed to
grmr/recurse must match the string we'll use later in defining
value. It doesn't have to be the same as the symbol we're assigning this to, but that reduces confusion. If Toccata had a macro system (and it will eventually) this kind of boilerplate would be handled in some kind of
defgrammar macro or something.
Now, let's describe the array grammar.
There's a few funky things to account for the peculuarities of the JSON array spec, but it works. I'd suggest putting
print-errs in various places and parsing some JSON to gain a better understanding of what's going on.
The grammar for an object isn't much more complicated.
And now, we need to fill in that placeholder for
The first argument to
grmr/recursive-rule must match the parameter passed to
grmr/recurse above. In fact, they don't even have to be strings. Just two values that can be compared for equality. The second argument is the actual grammar that describes a value. In addition to what we've seen described above, there are 3 anonymous grammars for some JSON literal values.
And there you go, a grammar that describes JSON. Now create a parser from it.
And you're good to go. I'll leave slurping in a file and parsing it for you to do on your own. Have fun!
Syntax diagrams are not the only way to specify grammars. I'm planning on covering a different way next time. Along with something that I hope will be a mind blowing revelation. Or at least a little surprising.