The JSON Spec

As promised, we're going to write a JSON parser today. With a couple of important caveats which I'll cover later. Before reading further, open this link in another tab. It's the official json.org site and has the JSON grammar specified as syntax diagrams.

And now, let's get the boiler plate out of the way.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))
(add-ns fio (git-dependency "https://github.com/Toccata-Lang/file-io.git"
                            "file-io.toc"
                            :sha "e7a489b"))

Now for the actual grammar. I mean, how hard could it be to parse JSON?

The JSON grammar

Well actually, it's not that hard. We've already covered almost everything we need and with the addition of one more concept, we'll be golden. This post will mostly be a bunch of code with relatively little exposition.

When reading the JSON syntax diagrams, there's an assumption that is not explicitly stated; The use of whitespace to separate various pieces. So our first task is to account for that.

(def whitespace
  (grmr/none-or-more (grmr/any " " "\t" "\r" "\n" "\f")))

The first actual entity we want to describe are numbers. But since Toccata doesn't have support for floats yet, we're going to ignore them. A JSON integer can be either positive or negative, so we'll describe positive ones first, then account for negative ones.

(def integer-value (grmr/apply-fn (fn [negate digits]
                                    (let [magnitude (str-to-int (to-str digits))]
                                      (either (and negate
                                                   (maybe (* -1 magnitude)))
                                              magnitude)))
                                  (grmr/ignore whitespace)
                                  (grmr/optional "-")
                                  (grmr/one-or-more grmr/digit)))

Now let's look at character strings. We have to account for escaping special characters and that takes a lot of lines but mostly repetition.

(def escaped-doublequote
  (grmr/apply-fn (fn [& _] "\"")
                 "\""))

(def escaped-backslash
  (grmr/apply-fn (fn [& _] "\\")
                 "\\"))

(def escaped-slash
  (grmr/apply-fn (fn [& _] "/")
                 "/"))

(def escaped-backspace
  (grmr/apply-fn (fn [& _] "\b")
                 "b"))

(def escaped-formfeed
  (grmr/apply-fn (fn [& _] "\f")
                 "f"))

(def escaped-newline
  (grmr/apply-fn (fn [& _] "\n")
                 "n"))

(def escaped-return
  (grmr/apply-fn (fn [& _] "\r")
                 "r"))

(def escaped-tab
  (grmr/apply-fn (fn [& _] "\t")
                 "t"))

(def escaped-chars (grmr/all "\\" (grmr/any escaped-doublequote
                                            escaped-backslash
                                            escaped-slash
                                            escaped-backspace
                                            escaped-formfeed
                                            escaped-newline
                                            escaped-return
                                            escaped-tab)))

And now we can describe a JSON character string. The second caveat is that this does not handle escaped Unicode character sequences.

(def string-value (grmr/apply-fn identity
                                 (grmr/ignore whitespace)
                                 (grmr/apply-fn to-str
                                                (grmr/ignore "\"")
                                                (grmr/none-or-more (grmr/any escaped-chars
                                                                             (grmr/not-char "\"")))
                                                (grmr/ignore "\""))))

With the basics out of the way, we come to that new concept.

Both the 'object' and 'array' structures from the JSON spec can nest inside each other. So arrays can contain values which are arrays or objects and likewise for objects. But there's no way for a grammar definition to refer to itself by name because they are strictly values and must be fully specified before being assigned to a name. Officially, these data structures are recursive. Not only that, they are mutually recursive. If you read the spec for arrays, you see that they can contain any number of values. And a value can be a string, number, object, array, etc. Therefore, before describing the grammar for array, we have to describe value, but we can't describe value without array. What to do, what to do?

The solution is to 'forward declare' a placeholder for value and then replace it with the real grammar later. Here's how you do that.

(def value (grmr/recurse "value"))

The string passed to grmr/recurse must match the string we'll use later in defining value. It doesn't have to be the same as the symbol we're assigning this to, but that reduces confusion. If Toccata had a macro system (and it will eventually) this kind of boilerplate would be handled in some kind of defgrammar macro or something.

Now, let's describe the array grammar.

(def comma (grmr/all whitespace "," whitespace))

(def array
  (grmr/apply-fn flatten
                 (grmr/ignore whitespace)
                 (grmr/ignore "[")
                 (grmr/none-or-more
                  (grmr/apply-fn (fn [items last-item]
                                   (conj (flatten items) last-item))
                                 (grmr/none-or-more (grmr/all value
                                                              (grmr/ignore comma)))
                                 value))
                 (grmr/ignore whitespace)
                 (grmr/ignore "]")))

There's a few funky things to account for the peculuarities of the JSON array spec, but it works. I'd suggest putting print-errs in various places and parsing some JSON to gain a better understanding of what's going on.

The grammar for an object isn't much more complicated.

(def colon (grmr/all whitespace ":" whitespace))

(def key-value-pair (grmr/all string-value
                              (grmr/ignore colon)
                              value))

(def object
  (grmr/apply-fn (fn [kv-pairs]
                   (-> kv-pairs
                       flatten
                       (reduce {} (fn [m [k v]]
                                    (assoc m k v)))))
                 (grmr/ignore whitespace)
                 (grmr/ignore "{")
                 (grmr/none-or-more
                  (grmr/apply-fn (fn [items last-item]
                                   (conj (flatten items) last-item))
                                 (grmr/none-or-more (grmr/all key-value-pair
                                                              (grmr/ignore comma)))
                                 key-value-pair))
                 (grmr/ignore whitespace)
                 (grmr/ignore "}")))

And now, we need to fill in that placeholder for value.

(def value
  (grmr/recursive-rule "value"
                       (grmr/any string-value
                                 integer-value
                                 object
                                 array
                                 (grmr/apply-fn (fn [_ _] (maybe 'true))
                                                whitespace
                                                "true")
                                 (grmr/apply-fn (fn [_ _] nothing)
                                                whitespace
                                                "false")
                                 (grmr/apply-fn (fn [_ _] 'null)
                                                whitespace
                                                "null"))))

The first argument to grmr/recursive-rule must match the parameter passed to grmr/recurse above. In fact, they don't even have to be strings. Just two values that can be compared for equality. The second argument is the actual grammar that describes a value. In addition to what we've seen described above, there are 3 anonymous grammars for some JSON literal values.

And there you go, a grammar that describes JSON. Now create a parser from it.

(def parser (rd/parser value))

And you're good to go. I'll leave slurping in a file and parsing it for you to do on your own. Have fun!

EBNF

Syntax diagrams are not the only way to specify grammars. I'm planning on covering a different way next time. Along with something that I hope will be a mind blowing revelation. Or at least a little surprising.