We're rolling now

I don't know about you, but I'm really pleased at how this blog series is developing. Each post seems to be building on the last by adding a couple of new pieces. I would love to say I intended it that way, but I can't.

In this post, we're going to begin looking at file I/O in addition to parsing.

More than command lines

So far, we've seen how to parse the command line arguments passed to our main function. This could be extended to create a full blown arguments parsing library with long and short options, etc. But at some point, we're going to need more than what can be conveniently passed in on the command line. We're going to want to keep a file full of configuration information that will be read at each invocation and its contents used to inform our program's execution.

This turns out to be just a tiny step beyond what we've already seen. I'm of the opinion that config files should be relatively simple. If you need to use JSON or XML for nesting complex data structures, you're getting close to needing a full-blown DSL. And what you're writing is more of an interpreter than just a program with a config file. So for our purposes, we're going to limit the lines of a config file to

  • empty lines (only containg spaces and tabs)
  • comment lines beginning with '#'
  • value lines that have a name (alphabetic characters and '-') and a value (integers or strings delimited by '"')

and that's it. So let's do this.

The code

First, import the needed libraries

(add-ns rd (git-dependency "https://github.com/Toccata-Lang/recursive-descent.git"
                           "recursive-descent.toc"
                           :sha "882b014"))
(add-ns grmr (git-dependency "https://github.com/Toccata-Lang/grammar.git"
                             "grammar.toc"
                             :sha "7690cd3"))
(add-ns fio (git-dependency "https://github.com/Toccata-Lang/file-io.git"
                            "file-io.toc"
                            :sha "e7a489b"))

We've added the file I/O library to our list of imports. This lib contains some functions to do basic reads and writes of files.

Now let's start describing our grammar for the config file lines. Here's an empty line.

(def whitespace (grmr/one-or-more (grmr/any " "
                                            "\t")))

(def newline "\n")

(def empty-line (grmr/all (grmr/optional whitespace)
                          newline))

We start by declaring rules for white space and a new line character. Then define the empty-line rule.

Now, we declare a comment

(def not-newline (grmr/not-char "\n"))

(def comment (grmr/all (grmr/optional whitespace)
                       "#"
                       (grmr/none-or-more not-newline)
                       newline))

The grmr/not-char combinator does what it says. It matches any charactor other than the one given. Now, let's bring over name and integer from the last post.

(def name (map (grmr/one-or-more (grmr/any grmr/alpha
                                           "-"))
               to-str))

(def integer-value (map (grmr/one-or-more grmr/digit)
                        (fn [digits]
                          (str-to-int (to-str digits)))))

A new requirement is reading in strings delimited by double quotes. But we have all the tools to do this easily.

(def string-value (grmr/apply-fn to-str
                                 (grmr/ignore "\"")
                                 (grmr/none-or-more (grmr/not-char "\""))
                                 (grmr/ignore "\"")))

And now, the rubber meets the road. We need to combine these pieces to parse a config file line and create a data structure of the name string and config value. A moment's thought surfaces some requirements.

  • We don't know which names will be defined in the file
  • We don't know what order they will be defined in
  • We want to end up with a HashMap of all names to values in the file

There are multiple ways to do this. My preferred way is to put each name/value pair into a HashMap and then just compose all the maps after parsing all the lines.

(def config-line (grmr/apply-fn (fn [param val]
                                  {param val})
                                (grmr/ignore (grmr/optional whitespace))
                                name
                                (grmr/ignore whitespace)
                                (grmr/any integer-value
                                          string-value)
                                (grmr/ignore (grmr/optional whitespace))
                                (grmr/ignore newline)))

(side note: I left that anonymous function in to show the param and value. But replacing it with hash-map works just fine.)

The final piece is to pull it all together to declare a grammar that will parse an entire config file

(defn ignore [g]
  (map g (fn [_] {})))

(def config-file (map (grmr/none-or-more (grmr/any (ignore comment)
                                                   config-line
                                                   (ignore empty-line)))
                      (fn [config-lines]
                        (comp* {} config-lines))))

(def parser (rd/parser config-file))

Each line gets converted to a HashMap. But we need to ignore the comment and empty lines. So we define a 'higher order' grammar rule. (Really, it's just a function that takes a grammar rule and returns a modified version.) In this case, ignore takes a grammar a rule and returns a rule that always returns an empty HashMap upon a successful match. This is different from and does not replace grmr/ignore from the grammar library.

And then, it's straightforward to define our config file grammar. It's just none (since the file may be empty) or more lines. Each line may be either a comment or an empty line, which produces an empty HashMap. Or a configuration name/value pair.

This list of HashMaps is passed to the anonymous function as the parameter config-lines. Then we use the parametric form of comp, the comp* function to squash all the HashMap into a single HashMap. This also means that if a config name is defined multiple times in a file, only the last one will appear in the final HashMap.

All that's left is to read in the config file and parse it.

(main [_]
      (for [config-map (parser (fio/slurp "config.txt"))]
        (map (seq config-map) (fn [[name value]]
                                (println (str name ":") value)))))

I'm going to leave the explanation of the main function out. You should be able to read this if you've read the previous posts. I will say that fio/slurp just pulls the entire contents of a file into a string in one shot.

So, if you have a config file that looks like this

# this is a comment
some-config    19
thread-count     100
str-val         "some string"
url             "http://another.com"

#              another comment

When you run this program, you should get

url: http://another.com
str-val: some string
some-config: 19
thread-count: 100

Though the order of the lines may be different.

Better than a library

In other languages, there might be a library or package that you would import to handle config files. Open source is great, but when you pull in a package like that, you take on the responsibility for keeping it integrated with your application and that comes at a cost. What I've tried to show here is how easy it is to have custom software to do exactly what you need.

For example, it would be trivial to only allow certain names to be in the config file. Or add different kinds of values. I think this is much superior to importing a bunch of libraries to do relatively simple tasks.

But ...

What if you really do want to parse JSON?

That's up next