Word Up

Welcome to another exciting episode of "Parsing With Toccata", kids. In today's installment, we're going to parse a sentence...

"i am not a bot"

Given the sentence above, we could write a grammar to parse it into a list of words using string literals. But that's not very interesting or useful. Instead, we're going to use some more combinators from the grammar repo to do the job. We'll do this in small steps and I'll leave out the boiler plate. This is going to introduce a whole lot of new things all at once, but I think you'll be able to follow it.

First up, a 'word' is a sequence of (lower-case only for now) letters seperated by a single space. Except that a word may be at the front or the end of the string, so we can't say that a word will always have a space either before or after it. Therefore, it's going to be optional. And if a space is read by the parser, we want to ignore it. So the first piece of our grammar is

(def space (grmr/ignore (grmr/optional " ")))

This expression describes a grammar that optionally matches a single space character, which it then ignores. This grammar is assigned to the symbol space. The grmr/optional and grmr/ignore functions are unary combinators that each take a single grammar as an argument and return a new grammar which modifies the given grammar in some fashion.

I'm going to bring out the next expression in pieces.

(grmr/char-range "a" "z")

This expression is one of those new 'primitive' grammars. It specifies a grammar that will match any character in the range of "a" to "z", inclusive.

But a word is a sequence of such characters, so we need to use another combinator to say "match one or more of these characters in a row".

(grmr/one-or-more (grmr/char-range "a" "z"))

It does what is says on the tin. But there's a problem. When the grammar produced by the grmr/one-or-more combinator successfully matches a sequence of characters, it produces a list of characters, like this. (I'm pretending I have a REPL. My TODO list is way too long. sigh)

((rd/parser (grmr/one-or-more (grmr/char-range "a" "z"))) "not")
=> <maybe ["n", "o", "t"]>

So on success, it produces a vector of strings. (Results simulated. Actual printed output may vary.)

What we need is a way to concatenate that list of strings into one string. Just such a function is in the core under the Seqable protocol; to-str. This protocol function can be implemented for any data type that has a notion of 'sequence' to efficiently produce a string from the contents. Conveniently, it is defined for Vector's and List's.

But a grammar doesn't implement to-str because it's not a sequence. What we really want is to apply that function to what the grammar matches. One way to look at grammars is that they 'contain' a result of a successful match (potentially), sort of like how Maybe values may contain another value or not. And the Container protocol has a nifty function called map for just this occasion which is implemented for grammars. Sooo ...

(def word (map (grmr/one-or-more (grmr/char-range "a" "z"))
               to-str))

The call to map the grmr/one-or-more expression with the function to-str creates a new grammar. Hopefully, you can see what the result does. (Another simulated REPL command)

((rd/parser word) "not")
=> <maybe "not">

Now, we need a grammar that matches one or more words in the sentence. I'm just gonna drop the whole thing in here and explain the two new pieces.

(def words (map (grmr/one-or-more (grmr/all word space))
                flatten))

First of all, grmr/all create a new grammar from a number of other grammars, in this case word and space. On second thought, this deserves it's own REPL simulation

((rd/parser (grmr/all word space)) "not ")
=> <maybe ("not")>

So what the grmr/all combinator does is create a new grammar that on a successful match of all of the sub-grammars, returns a list of the result of each of the successful matches. In this case, the space grammar ignores its match, so it doesn't show up even though there's a space in the original string. So if we match one or more of this grammar, we get a list of lists. (Final simulated REPL.)

((rd/parser (grmr/one-or-more (grmr/all word space))) "not a bot")
=> <maybe [("not"), ("a"), ("bot")]>

Which is not what we want. So we create a new grammar by mapping the flatten function over the result to produce the words grammar.

Here's the final program in all its copy-and-past glory. Hopefully, it's an easy read now.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))

(def space (grmr/ignore (grmr/optional " ")))

(def word (map (grmr/one-or-more (grmr/char-range "a" "z"))
               to-str))

(def words (map (grmr/one-or-more (grmr/all word space))
                flatten))

(def parser (rd/parser words))

(main [_]
      (println "=>" (parser "i am not a bot"))
      (println "=>" (parser "i am NOT a bot")))

Put that in a file and run it.

(Now I'm taking a nap.)