Word Up
Welcome to another exciting episode of "Parsing With Toccata", kids. In today's installment, we're going to parse a sentence...
Given the sentence above, we could write a grammar to parse it into a list of words using string literals. But that's not very interesting or useful. Instead, we're going to use some more combinators from the grammar repo to do the job. We'll do this in small steps and I'll leave out the boiler plate. This is going to introduce a whole lot of new things all at once, but I think you'll be able to follow it.
First up, a 'word' is a sequence of (lower-case only for now) letters seperated by a single space. Except that a word may be at the front or the end of the string, so we can't say that a word will always have a space either before or after it. Therefore, it's going to be optional. And if a space is read by the parser, we want to ignore it. So the first piece of our grammar is
This expression describes a grammar that optionally matches a single space character, which it then ignores. This grammar is assigned to the symbol space
. The grmr/optional
and grmr/ignore
functions are unary combinators that each take a single grammar as an argument and return a new grammar which modifies the given grammar in some fashion.
I'm going to bring out the next expression in pieces.
This expression is one of those new 'primitive' grammars. It specifies a grammar that will match any character in the range of "a" to "z", inclusive.
But a word is a sequence of such characters, so we need to use another combinator to say "match one or more of these characters in a row".
It does what is says on the tin. But there's a problem. When the grammar produced by the grmr/one-or-more
combinator successfully matches a sequence of characters, it produces a list of characters, like this. (I'm pretending I have a REPL. My TODO list is way too long. sigh)
So on success, it produces a vector of strings. (Results simulated. Actual printed output may vary.)
What we need is a way to concatenate that list of strings into one string. Just such a function is in the core under the Seqable
protocol; to-str
. This protocol function can be implemented for any data type that has a notion of 'sequence' to efficiently produce a string from the contents. Conveniently, it is defined for Vector's and List's.
But a grammar doesn't implement to-str
because it's not a sequence. What we really want is to apply that function to what the grammar matches. One way to look at grammars is that they 'contain' a result of a successful match (potentially), sort of like how Maybe values may contain another value or not. And the Container
protocol has a nifty function called map
for just this occasion which is implemented for grammars. Sooo ...
The call to map
the grmr/one-or-more
expression with the function to-str
creates a new grammar. Hopefully, you can see what the result does. (Another simulated REPL command)
Now, we need a grammar that matches one or more words in the sentence. I'm just gonna drop the whole thing in here and explain the two new pieces.
First of all, grmr/all
create a new grammar from a number of other grammars, in this case word
and space
. On second thought, this deserves it's own REPL simulation
So what the grmr/all
combinator does is create a new grammar that on a successful match of all of the sub-grammars, returns a list of the result of each of the successful matches. In this case, the space
grammar ignores its match, so it doesn't show up even though there's a space in the original string. So if we match one or more of this grammar, we get a list of lists. (Final simulated REPL.)
Which is not what we want. So we create a new grammar by mapping the flatten
function over the result to produce the words
grammar.
Here's the final program in all its copy-and-past glory. Hopefully, it's an easy read now.
Put that in a file and run it.
(Now I'm taking a nap.)