The JSON Spec
As promised, we're going to write a JSON parser today. With a couple of important caveats which I'll cover later. Before reading further, open this link in another tab. It's the official json.org site and has the JSON grammar specified as syntax diagrams.
And now, let's get the boiler plate out of the way.
(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
"recursive-descent.toc"
:sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
"grammar.toc"
:sha "7690cd3"))
(add-ns fio (git-dependency "https://github.com/Toccata-Lang/file-io.git"
"file-io.toc"
:sha "e7a489b"))
Now for the actual grammar. I mean, how hard could it be to parse JSON?
The JSON grammar
Well actually, it's not that hard. We've already covered almost everything we need and with the addition of one more concept, we'll be golden. This post will mostly be a bunch of code with relatively little exposition.
When reading the JSON syntax diagrams, there's an assumption that is not explicitly stated; The use of whitespace to separate various pieces. So our first task is to account for that.
(def whitespace
(grmr/none-or-more (grmr/any " " "\t" "\r" "\n" "\f")))
The first actual entity we want to describe are numbers. But since Toccata doesn't have support for floats yet, we're going to ignore them. A JSON integer can be either positive or negative, so we'll describe positive ones first, then account for negative ones.
(def integer-value (grmr/apply-fn (fn [negate digits]
(let [magnitude (str-to-int (to-str digits))]
(either (and negate
(maybe (* -1 magnitude)))
magnitude)))
(grmr/ignore whitespace)
(grmr/optional "-")
(grmr/one-or-more grmr/digit)))
Now let's look at character strings. We have to account for escaping special characters and that takes a lot of lines but mostly repetition.
(def escaped-doublequote
(grmr/apply-fn (fn [& _] "\"")
"\""))
(def escaped-backslash
(grmr/apply-fn (fn [& _] "\\")
"\\"))
(def escaped-slash
(grmr/apply-fn (fn [& _] "/")
"/"))
(def escaped-backspace
(grmr/apply-fn (fn [& _] "\b")
"b"))
(def escaped-formfeed
(grmr/apply-fn (fn [& _] "\f")
"f"))
(def escaped-newline
(grmr/apply-fn (fn [& _] "\n")
"n"))
(def escaped-return
(grmr/apply-fn (fn [& _] "\r")
"r"))
(def escaped-tab
(grmr/apply-fn (fn [& _] "\t")
"t"))
(def escaped-chars (grmr/all "\\" (grmr/any escaped-doublequote
escaped-backslash
escaped-slash
escaped-backspace
escaped-formfeed
escaped-newline
escaped-return
escaped-tab)))
And now we can describe a JSON character string. The second caveat is that this does not handle escaped Unicode character sequences.
(def string-value (grmr/apply-fn identity
(grmr/ignore whitespace)
(grmr/apply-fn to-str
(grmr/ignore "\"")
(grmr/none-or-more (grmr/any escaped-chars
(grmr/not-char "\"")))
(grmr/ignore "\""))))
With the basics out of the way, we come to that new concept.
Both the 'object' and 'array' structures from the JSON spec can nest inside each other. So arrays can contain values which are arrays or objects and likewise for objects. But there's no way for a grammar definition to refer to itself by name because they are strictly values and must be fully specified before being assigned to a name. Officially, these data structures are recursive
. Not only that, they are mutually recursive. If you read the spec for arrays, you see that they can contain any number of value
s. And a value can be a string, number, object, array, etc. Therefore, before describing the grammar for array, we have to describe value, but we can't describe value without array. What to do, what to do?
The solution is to 'forward declare' a placeholder for value and then replace it with the real grammar later. Here's how you do that.
(def value (grmr/recurse "value"))
The string passed to grmr/recurse
must match the string we'll use later in defining value
. It doesn't have to be the same as the symbol we're assigning this to, but that reduces confusion. If Toccata had a macro system (and it will eventually) this kind of boilerplate would be handled in some kind of defgrammar
macro or something.
Now, let's describe the array grammar.
(def comma (grmr/all whitespace "," whitespace))
(def array
(grmr/apply-fn flatten
(grmr/ignore whitespace)
(grmr/ignore "[")
(grmr/none-or-more
(grmr/apply-fn (fn [items last-item]
(conj (flatten items) last-item))
(grmr/none-or-more (grmr/all value
(grmr/ignore comma)))
value))
(grmr/ignore whitespace)
(grmr/ignore "]")))
There's a few funky things to account for the peculuarities of the JSON array spec, but it works. I'd suggest putting print-err
s in various places and parsing some JSON to gain a better understanding of what's going on.
The grammar for an object isn't much more complicated.
(def colon (grmr/all whitespace ":" whitespace))
(def key-value-pair (grmr/all string-value
(grmr/ignore colon)
value))
(def object
(grmr/apply-fn (fn [kv-pairs]
(-> kv-pairs
flatten
(reduce {} (fn [m [k v]]
(assoc m k v)))))
(grmr/ignore whitespace)
(grmr/ignore "{")
(grmr/none-or-more
(grmr/apply-fn (fn [items last-item]
(conj (flatten items) last-item))
(grmr/none-or-more (grmr/all key-value-pair
(grmr/ignore comma)))
key-value-pair))
(grmr/ignore whitespace)
(grmr/ignore "}")))
And now, we need to fill in that placeholder for value
.
(def value
(grmr/recursive-rule "value"
(grmr/any string-value
integer-value
object
array
(grmr/apply-fn (fn [_ _] (maybe 'true))
whitespace
"true")
(grmr/apply-fn (fn [_ _] nothing)
whitespace
"false")
(grmr/apply-fn (fn [_ _] 'null)
whitespace
"null"))))
The first argument to grmr/recursive-rule
must match the parameter passed to grmr/recurse
above. In fact, they don't even have to be strings. Just two values that can be compared for equality. The second argument is the actual grammar that describes a value. In addition to what we've seen described above, there are 3 anonymous grammars for some JSON literal values.
And there you go, a grammar that describes JSON. Now create a parser from it.
(def parser (rd/parser value))
And you're good to go. I'll leave slurping in a file and parsing it for you to do on your own. Have fun!
EBNF
Syntax diagrams are not the only way to specify grammars. I'm planning on covering a different way next time. Along with something that I hope will be a mind blowing revelation. Or at least a little surprising.