Jan 2, 2018 - Parsing Command Lines

Comments

For Arguments Sake

In a previous post, I very briefly dipped into passing arguments to a Toccata program on the command line. But we need to be able to do much more than just grab a string from the command arguments. We need to convert said arguments (which are strings) to actual data we can use to customize how our programs run. Many languages have custom libraries to handle command line arguments, but I prefer to have a single 'canonical' way of handling parsing in Toccata. So I've tried to make the parsing library I've been discussing easy enough to use for the command line arguments.

The first thing to know is that the single argument passed to the main function is actually a list of strings. The first string is the command that was run (regardless of what program actually is executing) and the rest are the command line arguments. This has the effect of removing any whitespace separating the arguments on the command line.

This leads to the second thing. The parser generated by rd/parser will happily parse a list of strings just as easily as a single string. So parsing command line arguments is just a matter of specifying a grammar for the arguments, generating a parser from it, and then passing the arguments list to said parser.

But there's a wrinkle. Depending on how you want to parse your arguments, you may need some way of separating one from the next. This is what spaces are used for. But they've all been removed by the shell, so they will need to be put back in. I prefer to have spaces in there by default, so that's what I'll show below.

Nuts and Bolts

We (or at least I) need a program that will take an integer and a name and print a greeting the desired number of times. Here's about the simplest way I could think of to do this in about 2 minutes of work.

(defn print-msg [n msg]
  (either (= 0 n)
          (do
            (println msg)
            (print-msg (dec n) msg))))

(main [args]
      (let [[_ n name] args]
        (print-msg (str-to-int n)
                   (str "Hello, " name))))

The function print-msg is a classic example of how to do explicit loops in Toccata (as opposed to implicit loops using map, reduce, etc) using recursion and an either expression.

In the main function is where things are brittle. args is the list of command line arguments. For this to work, we assume a number of things. That it will have at least 3 elements, that the second element is a string containing only decimal integer digits, etc.

This does exactly what we wanted and is perfectly fine for a one-off task or something that I alone would use. It works if you give it only the correct inputs, but stray from that happy path and things blow up. Often with incomprehensible error messages. So let's make this more resiliant using the parsing libraries we've already seen in action.

Parsing the command line

First, the boilerplate.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))

And the function that actually does the work, again

(defn print-msg [n msg]
  (either (= 0 n)
          (do
            (println msg)
            (print-msg (dec n) msg))))

Like most software, the actual functionality is quite small, it's all the supporting code that takes most of the time and effort. First up is a grammar rule that parses a string of digits and produces a positive integer.

(def integer (map (grmr/one-or-more grmr/digit)
                  (fn [digits]
                    (str-to-int (to-str digits)))))

Hopefully this is understandable at this point. If one or more digits are parsed from the input, the list of digit characters is converted to a single string and that string is then passed to str-to-int, producing the integer from the command line parameter. Anything else will fail.

Next, we need to parse the name parameter.

(def name (map (grmr/one-or-more grmr/alpha)
               to-str))

This restricts the name to only be alpha characters. (My apologies to anyone with apostrophes or other non-alpha characters in their name.) The list of characters parsed are then converted to a string by to-str.

And now, putting the grammar rules together into the full grammar.

(def parse-and-print (grmr/apply-fn (fn [n name]
                                      (print-msg n (str "Hello, " name)))
                                    integer
                                    (grmr/ignore " ")
                                    name))

This introduces a new concept, gmrm/apply-fn. This grammar combinator takes a function and a number of grammars. In this case, there are 3 grammars; integer, an unnamed one for a single space, and name. These 3 grammars are eached applied to the input, one after another. If all 3 successfully parse their piece of the input, the results of each of them are passed to the anonymous function which is the first value passed to grmr/apply-fn. Except the grammar that parses a single space is created with the grmr/ignore combinator, so it's result is discarded, which is why the anonomous function only takes 2 parameters and not 3.

Of those 2 parameters, n is the result of the integer grammar rule and name is the result of the name grammar rule. The anonomous function then builds the greeting string and calls print-msg.

But parse-and-print isn't a function that can be called. It's just a grammar, a data structure that describes the command line arguments and what to do with them. We need to create a parser from it as we've seen before.

(def parser (rd/parser parse-and-print))

And now parser is a function that will take a command line, parse the arguments and call print-msg if all arguments are of the correct form. Let's wrap this up with the main entry point.

(main [cmd-args]
      (either (-> cmd-args
                  rest
                  (interpose " ")
                  parser)
              (print-err "Invalid arguments.")))

Quite a few things happen in these few lines of code. First off, we take the rest of cmd-args to get only the command line arguments. Then we put a single space between them to match the grammar we declared above. This list of strings is then passed as input to the parser. If everything parses correctly and the greetings are printed, the value returned is a Maybe value containing the results of the parser (which we're ignoring for now). But if the parsing fails, the either expression falls through to the print-err expression to print a simple error message.

And that's it. This program is in the examples directory here.

You can run it using the run script in the project root.

bash-3.2$ ./run examples/number-option.toc 4

*** Invalid arguments.

The run script puts the executable into a file name m. The name doesn't mean anything, I just picked it at random. You can run the executable directly once it's been compiled.

bash-3.2$ ./m 3 Pop
Hello, Pop
Hello, Pop
Hello, Pop

Dec 28, 2017 - Parsing Words

Comments

Word Up

Welcome to another exciting episode of "Parsing With Toccata", kids. In today's installment, we're going to parse a sentence...

"i am not a bot"

Given the sentence above, we could write a grammar to parse it into a list of words using string literals. But that's not very interesting or useful. Instead, we're going to use some more combinators from the grammar repo to do the job. We'll do this in small steps and I'll leave out the boiler plate. This is going to introduce a whole lot of new things all at once, but I think you'll be able to follow it.

First up, a 'word' is a sequence of (lower-case only for now) letters seperated by a single space. Except that a word may be at the front or the end of the string, so we can't say that a word will always have a space either before or after it. Therefore, it's going to be optional. And if a space is read by the parser, we want to ignore it. So the first piece of our grammar is

(def space (grmr/ignore (grmr/optional " ")))

This expression describes a grammar that optionally matches a single space character, which it then ignores. This grammar is assigned to the symbol space. The grmr/optional and grmr/ignore functions are unary combinators that each take a single grammar as an argument and return a new grammar which modifies the given grammar in some fashion.

I'm going to bring out the next expression in pieces.

(grmr/char-range "a" "z")

This expression is one of those new 'primitive' grammars. It specifies a grammar that will match any character in the range of "a" to "z", inclusive.

But a word is a sequence of such characters, so we need to use another combinator to say "match one or more of these characters in a row".

(grmr/one-or-more (grmr/char-range "a" "z"))

It does what is says on the tin. But there's a problem. When the grammar produced by the grmr/one-or-more combinator successfully matches a sequence of characters, it produces a list of characters, like this. (I'm pretending I have a REPL. My TODO list is way too long. sigh)

((rd/parser (grmr/one-or-more (grmr/char-range "a" "z"))) "not")
=> <maybe ["n", "o", "t"]>

So on success, it produces a vector of strings. (Results simulated. Actual printed output may vary.)

What we need is a way to concatenate that list of strings into one string. Just such a function is in the core under the Seqable protocol; to-str. This protocol function can be implemented for any data type that has a notion of 'sequence' to efficiently produce a string from the contents. Conveniently, it is defined for Vector's and List's.

But a grammar doesn't implement to-str because it's not a sequence. What we really want is to apply that function to what the grammar matches. One way to look at grammars is that they 'contain' a result of a successful match (potentially), sort of like how Maybe values may contain another value or not. And the Container protocol has a nifty function called map for just this occasion which is implemented for grammars. Sooo ...

(def word (map (grmr/one-or-more (grmr/char-range "a" "z"))
               to-str))

The call to map the grmr/one-or-more expression with the function to-str creates a new grammar. Hopefully, you can see what the result does. (Another simulated REPL command)

((rd/parser word) "not")
=> <maybe "not">

Now, we need a grammar that matches one or more words in the sentence. I'm just gonna drop the whole thing in here and explain the two new pieces.

(def words (map (grmr/one-or-more (grmr/all word space))
                flatten))

First of all, grmr/all create a new grammar from a number of other grammars, in this case word and space. On second thought, this deserves it's own REPL simulation

((rd/parser (grmr/all word space)) "not ")
=> <maybe ("not")>

So what the grmr/all combinator does is create a new grammar that on a successful match of all of the sub-grammars, returns a list of the result of each of the successful matches. In this case, the space grammar ignores its match, so it doesn't show up even though there's a space in the original string. So if we match one or more of this grammar, we get a list of lists. (Final simulated REPL.)

((rd/parser (grmr/one-or-more (grmr/all word space))) "not a bot")
=> <maybe [("not"), ("a"), ("bot")]>

Which is not what we want. So we create a new grammar by mapping the flatten function over the result to produce the words grammar.

Here's the final program in all its copy-and-past glory. Hopefully, it's an easy read now.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))

(def space (grmr/ignore (grmr/optional " ")))

(def word (map (grmr/one-or-more (grmr/char-range "a" "z"))
               to-str))

(def words (map (grmr/one-or-more (grmr/all word space))
                flatten))

(def parser (rd/parser words))

(main [_]
      (println "=>" (parser "i am not a bot"))
      (println "=>" (parser "i am NOT a bot")))

Put that in a file and run it.

(Now I'm taking a nap.)

Dec 26, 2017 - Parsing Beginnings

Comments

Changing Direction

I hope you all had a very Merry Christmas.

For a long time, I've been writing posts about Toccata. I wanted a body of material that explained the philosophy and capabilities of Toccata. But now we change direction and instead of talking about Toccata, we're going to start using Toccata. Starting with the basics.

When I was young ...

and just learning to program, I remember reading an article/program in a computer magazine explaining something called 'parsing'. I didn't even know what 'parsing' was, but it was an interesting program to play around with. There's been a massive amount of research about parsing in the years since and many advances. Parsing a sequence of bytes into structured data is still one of the fundamental programming tasks today. And yet, hand rolling a parser with imperative code is hard to write, debug and verify. Many libraries exist in almost every language to help programmers and Toccata has it's own. Here's a short program to parse a hard-code text string.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))

(def parser (rd/parser "one"))

(main [_]
      (println "1:" (parser "one"))
      (println "2:" (parser "two")))

This is a significant chunk of code, but we'll be building on it, so I'm going to explain it all line-by-line. You can copy-and-past the above to a file and use the run script to run it.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))

This expression is how to import a Git repo as a dependency. There is no central package repository. Instead any Git repo can be a dependency. In this case, the file recursive-descent.toc is the file imported from the recursive descent repo. This expression also states that any symbol imported from recursive-descent.toc must be prefixed by rd/ when referenced in this file.

(def parser (rd/parser "one"))

This is how a recursive descent parser is created from a grammar, in this case, the simplest possible one. Couldn't be easier. This dependency has the definitions to produce a recursive descent parser from a grammar. I intend to provide other kinds of parsers eventually.

(main [_]
      (println "1:" (parser "one"))
      (println "2:" (parser "two")))

And here we have 2 examples of using that parser to parse 2 different strings. The output generated from this short little program is

1: <maybe one>
2: <nothing>

If the parser successfully matches the string passed to it, the default is to just return that string wrapped in a Maybe value. But if it doesn't match the string, nothing is returned. That's enough to get us started.

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))

(def grammar (grmr/any "one"
                       "two"
                       "three"))

(def parser (rd/parser grammar))

(main [_]
      (println "1:" (parser "one"))
      (println "2:" (parser "two"))
      (println "3:" (parser "three"))
      (println "4:" (parser "four")))

Literal strings are just one of several 'simplest' grammars possible. The grammar.toc file has more as well as some functions for composing simpler grammars into more complex ones.

(def grammar (grmr/any "one"
                       "two"
                       "three"))

And here we have a simple grammar. In this case, 3 terminal grammars are composed into a single grammar with the grmr/any function. This composed grammar will match any of the three strings, but nothing else. We'll see this in action in a second.

(def parser (rd/parser grammar))

And we create a parser from the given grammar

(main [_]
      (println "1:" (parser "one"))
      (println "2:" (parser "two"))
      (println "3:" (parser "three"))
      (println "4:" (parser "four")))

And here we have 4 examples of using that parser to parse 4 different strings. The output generated from this short little program is

1: <maybe one>
2: <maybe two>
3: <maybe three>
4: <nothing>

No big deal.