Cat’s hacks:

a parser-combinator approach to parsing JSON

How I learned how to create parser combinators

Powerful, succinct code is built by combining small pieces into larger ones: functions out of functions, macros out of functions, macros out of macros... Here I describe how I learned how to use a powerful combinator technique: creating a parser with parser combinators.

Parser combinators

A parser combinator library lets you build larger parsers out of smaller ones. For example, a simple alphanum parser could recognize a single letter or digit character; a many1 parser combinator could take an existing parser and return a parser that recognized one or more instances of what the existing parser recognized, and then (many1 alphanum) would match one or more letters or digits.

My first attempt at a parser combinator library

When I needed to read and write JSON from Arc last year, I wrote a JSON reader/writer library. I had two goals: one, I simply needed to get a working library; two, I was curious to see if I could learn how to write a parser combinator library, so that the JSON library code would be simple, clear, and straightforward.

I can often create a library by first writing some working code, and then factoring common functionality out of the code to create the library. This is helpful because it means the library does what I need it to do to support my code, and at each step I have a complete working system that I can test to make sure I’m not making any mistakes. But with the first version of my JSON library, I ended up with code like

(def match-json-string (j)
    (if (is car.j #\")
         (let (j a)
              ((afn (j a)
                 (if no.j (err "missing close quote"))
                 (if (is car.j #\")
                      (list cdr.j a)
                      (iflet (j c) (match-json-backslash j)
                        (self j (cons c a))
                        (self (cdr j) (cons (car j) a)))))
               cdr.j nil)
           (list j (coerce rev.a 'string)))))

which, no matter how much I stared at it, I wasn’t able to figure out how to factor anything useful out.

I got some help on the Arc forum, and rather than passing the parse position around explicitly, we instead used an input stream to keep track of the parse position for us:

(def match-json-string ()
    (match-char #\"
      (liststr:accum a
        (while (~match-char #\")
          (atend-err "missing close quote")
          (a (or (match-json-backslash)
                 (readc (stdin))))))))

This version is much better: we can now easily see that a JSON string starts and ends with a quote character, and contains a sequence of regular characters and backslash escape sequences.

I could have continued and written the rest of the JSON parsing library in this style. However, I remained intrigued by the idea of parser combinators, so I thought I’d try again, this time from the very beginning.

My next try

In a parser combinator library, all the parser functions need to be able to be called in the same way so that they can be plugged in to the parser combinators, and the parsers returned by the parser combinators also need to use that same interface, so that they in turn can be plugged in to more parser combinators to build up a complete parser. Thus the choice of what interface to use is a key decision for a parser combinator library.

For my JSON library, I needed to be able to compose return values, as JSON values can recursively contain other JSON values. So I chose an interface where a parser is passed a parse position, and if it matches the input, returns a new parse position and a return value, and returns nil if it doesn't match the input.

This is hardly the only interface that could be chosen. The interface could use a continuation passing style, where the parser is passed a success function and a failure function to call with the parse result. A library could drop the return value from the interface if return values didn’t need to be composed for that application. Other applications might want a parser combinator library that had something else in the interface that supported some other need they had.

The parse position can also be represented in different ways. One choice is an integer position within the input string. For convenience I convert the input string to a list of characters, and then use a cons cell within the list can be used itself for the parse position. So that I can easily see when I’m returning a result, I’ll make return a synonym for list:

(def return (new-parse-position return-value)
  (list new-parse-position return-value))

Here’s a parser that matches the letter A, and has "found A!" as its return value if it succeeds:

arc> (def match-a (p)
       (if (is car.p #\a)
            (return cdr.p "found A!")))
#<procedure: match-a>
arc> (match-a '(#\a #\b #\c))
((#\b #\c) "found A!")
arc> (match-a '(#\d #\e #\f))
nil

The higher order parser combinator functions just pass the parser position around without looking at it, so they can be used whether p is a cons cell, an integer position in the input string, or something else.

Displaying the parse result

Next is a convenience function to convert a string into a list of characters, run a parser against it, and display the result:

arc> (def show-parse (parser str)
       (let p (coerce str 'cons)
         (iflet (p2 r) (parser p)
           (do (pr "match: ")
               (write r)
               (prn " remaining: " (coerce p2 'string)))
           (prn "no match")))
       nil)
#<procedure: show-parse>
arc> (show-parse match-a "abc")
match: "found A!" remaining: bc
nil
arc> (show-parse match-a "def")
no match
nil

Matching JSON literals


Matching a JSON “true” and returning t on a successful match is easy:

arc> (def json-true (p)
         (if (begins p (coerce "true" 'cons))
              (return (nthcdr 4 p) t)))
#<procedure: json-true>
arc> (show-parse json-true "true,1,2,3")
match: t remaining: ,1,2,3
nil
arc> (show-parse json-true "1,2,3")
no match
nil

To match any literal string, I can factor out the string that’s being matched and the return value:

(def match-literal (pat val)
  (with (patlist (coerce pat 'cons)
         patlen  len.pat)
    (fn (p)
      (if (begins p patlist)
           (return (nthcdr patlen p) val)))))

and make parsers for the JSON literals:

(= json-true  (match-literal "true"  t))
(= json-false (match-literal "false" nil))
(= json-null  (match-literal "null"  nil))
arc> (show-parse json-false "false")
match: nil remaining: 
nil

alt

I come to my first parser combinator. I want to be able to try a series of alternatives, until I find one that matches:

(= json-value
  (alt json-true
       json-false
       json-null))

I need to go through a list of parsers, and call each one on the parse position until I get one that doesn’t return nil, and return what it returns. Arc’s some function makes this easy:

(def alt parsers
  (fn (p)
    (some [_ p] parsers)))

Now I have a json-value parser that will match any of the JSON literals:

arc> (show-parse json-value "true,1,2")
match: t remaining: ,1,2
nil
arc> (show-parse json-value "false,1,2")
match: nil remaining: ,1,2
nil

JSON numbers

Like the match-literal parser that manually looks forward in the input stream to see if it matches a literal string, I could write a parser that manually matches JSON number characters in the input stream. What I’d like to do is make a many1 parser combinator, which matches one or more instances of something:

(= json-number
  (many1 json-number-char))

where json-number-char is a parser that matches one JSON number character:

(= json-number-char
  (match [find _ ".-+eE1234567890"]))

and match takes a predicate function. It succeeds returning the next item in the input if the function returns true when called with that item:

(def match (f)
  (fn (p)
    (and p
         (let x car.p
           (if (f x)
             (return cdr.p x))))))

many1

So what about many1? If I had a parser combinator many which matched zero or more things, and a combinator seq that matched several parsers in sequence (first thing A, then thing B, etc.), then I could write many1 as:

(def many1 (parser)
  (seq parser
       (many parser)))

which says that one way of matching one or more of something is to match one of them, followed by zero or more of that something.

Wow! Great! Progress! Now if only I had seq and many...

seq

The seq combinator takes a list of parsers, and applies them one after another to the input. The whole sequence only succeeds if all of the parsers succeed. And, for the return value, I want to get a list of what each of the parsers returned.

OK, so I’m going to need to loop through the parsers. The parse position is going to change each time, as each parser in turn successfully matches some input. And I’ll need a variable for the accumulated result, which starts off as nil:

(def seq parsers
  (fn (p)
    ((afn (p parsers a)
       ...)
     p parsers nil)))

If I’ve gotten through all the parsers, the parsers list will now be empty, which means all the parsers matched successfully, and so I can return the result. I’ll be cons’ing up the result as I go through the loop, so I’ll use rev to get “123” instead of “321”:

(def seq parsers
  (fn (p)
    ((afn (p parsers a)
       (if parsers
            'do-something-with-the-next-parser
            (return p rev.a)))
     p parsers nil)))

Hey, I’m making progress! I can already call seq on an empty parser list, and have it match no things and successfully return an empty list:

arc> (show-parse (seq) "123")
match: nil remaining: 123
nil

Impressive, yes? :-)

If there are some more parsers left, I need to call the next one, getting the new parse position and its return value. I’ll use iflet, so if the parser fails to match and returns nil, I’ll fall out of the whole loop returning nil from the parser created by seq, so that if any of the parsers fail then the entire match fails:

(iflet (p2 r) (car.parsers p)
  ...)

If the match succeeds, then I need to loop again with the new parse position, the next parser on the list, and add the return value to the accumulator:

(iflet (p2 r) (car.parsers p)
  (self p2 cdr.parsers (cons r a)))

Putting it all together, I get:

(def seq parsers
  (fn (p)
    ((afn (p parsers a)
       (if parsers
            (iflet (p2 r) (car.parsers p)
              (self p2 cdr.parsers (cons r a)))
            (return p rev.a)))
     p parsers nil)))

Let’s try it out:

arc> (show-parse (seq json-number-char
                      json-number-char
                      json-number-char)
                 "123")
match: (#\1 #\2 #\3) remaining: 
nil

Nifty. So, anyone see a way to make the seq function shorter?

many

I’m not sure how to implement many, so I’ll start with something simpler. optional matches zero or one things:

(def optional (parser)
  (fn (p)
    (iflet (p2 r) (parser p)
      ...)))

If the match succeeds, then I’ll want to return what the parser returned. Since I’m working towards many, and I’ll want many to return a list of the return values for each time the parser matches, I’ll have optional return a list of the one return value:

(def optional (parser)
  (fn (p)
    (iflet (p2 r) (parser p)
      (return p2 (list r))
      ...)))

If the parser doesn’t match, then the optional still succeeds, since it matches zero or one things, so I’ll have it return the empty list:

(def optional (parser)
  (fn (p)
    (iflet (p2 r) (parser p)
      (return p2 (list r))
      (return p nil))))
arc> (show-parse (optional json-number-char) "123")
match: (#\1) remaining: 23
nil
arc> (show-parse (optional json-number-char) "a123")
match: nil remaining: a123
nil

So many is like optional, except that it applies the parser to the input over and over again until finally the match fails:

arc> (def many (parser)
       (fn (p)
         ((afn (p a)
            (iflet (s2 r) (parser p)
              (self s2 (cons r a))
              (return p rev.a)))
          p nil)))
#<procedure: many>
arc> (show-parse (many json-number-char) "abc")
match: nil remaining: abc
nil
arc> (show-parse (many json-number-char) "1abc")
match: (#\1) remaining: abc
nil
arc> (show-parse (many json-number-char) "12abc")
match: (#\1 #\2) remaining: abc
nil
arc> (show-parse (many json-number-char) "123abc")
match: (#\1 #\2 #\3) remaining: abc
nil

many1, implemented

Now I get to try out my implementation of many1:

arc> (def many1 (parser)
       (seq parser
            (many parser)))
#<procedure: many1>
arc> (show-parse (many1 json-number-char) "123abc")
match: (#\1 (#\2 #\3)) remaining: abc
nil

The matching part is working perfectly, but I’m getting (r (r ...)) for my return value instead of the (r ...) that I want. That’s because seq is returning a list of the return values of its parsers: the first match and the many match. I’ll need to modify the return value to cons the two parts together:

(def many1 (parser)
  (fn (p)
    (iflet (p2 (r1 rs)) ((seq parser
                              (many parser))
                         p)
      (return p2 (cons r1 rs)))))

Now I get the return value I’m looking for:

arc> (show-parse (many1 json-number-char) "123abc")
match: (#\1 #\2 #\3) remaining: abc
nil

JSON numbers, implemented

This was my definition for parsing a JSON number:

(= json-number
  (many1 json-number-char))
arc> (show-parse json-number "123abc")
match: (#\1 #\2 #\3) remaining: abc
nil

But for the return value I want to get an Arc number, not just the list of parsed characters. Like with many1, I want to modify the value returned by a parser, and so it’s time to make a function to do that:

(def on-result (f parser)
  (fn (p)
    (iflet (p2 r) (parser p)
      (return p2 (f r)))))

Now many1 can use that:

(def many1 (parser)
  (on-result (fn ((r1 rs))
               (cons r1 rs))
             (seq parser
                  (many parser))))

Which I can make a bit shorter with a macro:

(mac with-result (vars parser . body)
  `(on-result (fn (,vars) ,@body)
              ,parser))
(def many1 (parser)
  (with-result (r1 rs) (seq parser
                            (many parser))
    (cons r1 rs)))

For seq specifically, perhaps I’ll often be working with the return value of each of the parsers in the sequence, so I might try a macro for that and see how often it’s used:

(mac with-seq (vars-parsers . body)
  (withs (ps (pair vars-parsers)
          vars (map car ps)
          parsers (map cadr ps))
    `(on-result (fn (,vars) ,@body) (seq ,@parsers))))
(def many1 (parser)
  (with-seq (r1 parser
             rs (many parser))
    (cons r1 rs)))

And finally the implementation of json-number:

(= json-number
  (with-result cs (many1 json-number-char)
    (coerce (coerce cs 'string) 'num)))
arc> (show-parse json-number "123abc")
match: 123 remaining: abc
nil

JSON strings


I’ll need to be able to parse the four hexadecimal digits after a Unicode escape sequence \u and turn it into a character:

(def hexdigit (c)
  (and (isa c 'char)
       (or (<= #\a c #\f) (<= #\A c #\F) (<= #\0 c #\9))))

(= fourhex
  (must "four hex digits required after \\u"  
    (with-seq (h1 (match hexdigit)
               h2 (match hexdigit)
               h3 (match hexdigit)
               h4 (match hexdigit))
      (coerce (int (coerce (list h1 h2 h3 h4) 'string) 16) 'char))))

Yup, with-seq turned out to be useful. The must function turns a failed match into an error:

(def must (errmsg parser)
  (fn (p)
    (or (parser p)
        (err errmsg))))

Let’s see, I’ll need to parse the other JSON backslash escape sequences:

(def json-backslash-char (c)
  (case c
    #\" #\"
    #\\ #\\
    #\/ #\/
    #\b #\backspace
    #\f #\page
    #\n #\newline
    #\r #\return
    #\t #\tab
    (err "invalid backslash char" c)))

A JSON string backslash escape sequence is one or the other:

(= json-backslash-escape
  (seq (match [is _ #\\])
       (alt (seq (match [is _ #\u])
                        fourhex)
            (fn (p)
              (return cdr.p (json-backslash-char car.p))))))

but oops, seq is giving me lists when all I want is just the character:

arc> (show-parse json-backslash-escape "\\u0041")
match: (#\\ (#\u #\A)) remaining: 
nil

In both cases I want just the return value of the second parser in the sequence, so I’ll make a combinator to do that:

(def seq2 parsers
  (with-result results (apply seq parsers)
    (results 1)))

And, I can extract a match-is:

(def match-is (x)
  (match [is x _]))

Now I have:

(= json-backslash-escape
  (seq2 (match-is #\\)
        (alt (seq2 (match-is #\u)
                   fourhex)
             (fn (p)
               (return cdr.p (json-backslash-char car.p))))))

That’s better:

arc> (show-parse json-backslash-escape "\\u0041")
match: #\A remaining: 
nil
arc> (show-parse json-backslash-escape "\\/")
match: #\/ remaining: 
nil
arc> (show-parse json-backslash-escape "\\\"")
match: #\" remaining: 
nil

Other characters in the string can be anything that isn’t a closing quote:

(match [isnt _ #\"])

Now I have an implementation for json-string:

(= json-string
  (with-result cs (seq2 (match-is #\")
                        (many (alt json-backslash-escape
                                   (match [isnt _ #\"])))
                        (match-is #\"))
    (coerce cs 'string)))
arc> (show-parse json-string "\"\\u0041b\\\\c\"")
match: "Ab\\c" remaining: 
nil

JSON arrays


How to match that?

After the first value, any comma has to be followed by another value. I don’t care about the return value of matching the comma, so I’ll use seq2 to get just the return value of the JSON value:

(seq2 (match-is #\,)
      json-value)

There can be many (zero or more) of those “comma followed by a value” pairs:

(many (seq2 (match-is #\,)
            json-value))

There does have to be a value before the first comma:

(seq json-value
     (many (seq2 (match-is #\,)
                 json-value)))

And the whole thing is optional, because the JSON array might be empty:

(optional (seq json-value
               (many (seq2 (match-is #\,)
                           json-value))))

So this is pretty close for matching a JSON array:

(= json-array
  (seq2 (match-is #\[)
        (optional (seq json-value
                       (many (seq2 (match-is #\,)
                                   json-value))))
        (match-is #\])))

Just a couple of problems. JSON arrays can contain JSON values, which can recursively contain JSON arrays...

(= json-value
  (alt json-string
       json-number
       json-array
       json-true
       json-false
       json-null))

But when I’m defining json-array, I haven’t defined json-value yet...

arc> (= json-array
       (seq2 (match-is #\[)
             (optional (seq json-value
                            (many (seq2 (match-is #\,)
                                        json-value))))
             (match-is #\])))

reference to undefined identifier: _json-value

Putting json-value first doesn’t help of course, since then it will be json-array that isn’t defined yet. So, I’ll need to wrap the reference to json-value in a function:

(= json-array
  (seq2 (match-is #\[)
        (optional (seq (fn (p) (json-value p))
                       (many (seq2 (match-is #\,)
                                   (fn (p) (json-value p))))))
        (match-is #\])))

Which I can make shorter with a macro:

(mac forward (parser)
  (w/uniq p
    `(fn (,p) (,parser ,p))))

Now I can easily have forward references:

(= json-array
  (seq2 (match-is #\[)
        (optional (seq forward.json-value
                       (many (seq2 (match-is #\,)
                                   forward.json-value))))
        (match-is #\])))

Next I need to fix the return value:

arc> (show-parse json-value "[1,2,3]")
match: ((1 (2 3))) remaining: 
nil

Back when I wrote optional, if the parser matched, I put its return value in a list. Now that I’m actually using optional for the first time, it turns out I don’t want that, I want just the value. An easy fix:

(def optional (parser)
  (fn (p)
    (iflet (p2 r) (parser p)
      (return p2 r)
      (return p nil))))

But now optional is just returning what the parser returns, so I could write it as:

(def optional (parser)
  (alt parser
       (fn (p)
         (return p nil))))

Now I get:

arc> (show-parse json-value "[1,2,3]")
match: (1 (2 3)) remaining: 
nil

This is the same pattern I had before with many1: a sequence of A followed by B, and I want to cons the single item returned by A together with the list of items returned by B. I can extract a parse-cons function for that:

(def cons-seq (a b)
  (with-seq (r  a
             rs b)
    (cons r rs)))

Now many1 is:

(def many1 (parser)
  (cons-seq parser
            (many parser)))

And I get the right return value from a JSON array:

(= json-array
  (seq2 (match-is #\[)
        (optional (cons-seq forward.json-value
                            (many (seq2 (match-is #\,)
                                        forward.json-value))))
        (match-is #\])))
arc> (show-parse json-value "[1,2,3]")
match: (1 2 3) remaining: 
nil

JSON objects


Both JSON arrays and objects have a list of things interspersed by commas; in the case of json-object, it’s key-value pairs. In json-array, it was JSON values that were separated by commas:

(optional (cons-seq forward.json-value
                    (many (seq2 (match-is #\,)
                                forward.json-value))))

I can extract that pattern:

(def parse-intersperse (separator parser)
  (optional (cons-seq parser
                      (many (seq2 separator
                                  parser)))))

A comma separated list is a specific case:

(def comma-separated (parser)
  (parse-intersperse (match-is #\,) parser))

Now json-array is:

(= json-array
  (seq2 (match-is #\[)
        (comma-separated forward.json-value)
        (match-is #\])))

In a json-object, key-value pairs are separated by a colon, and the key is always a JSON string.

(= json-object-kv
  (with-seq (key   json-string
             colon (match-is #\:)
             value forward.json-value)
    (list key value)))

This matches a single key-value pair, and returns them as a two element list:

arc> (show-parse json-object-kv "\"abc\":[1,2,3]")
match: ("abc" (1 2 3)) remaining: 
nil

Arc’s listtab will convert a list of those key-value pairs into a table for me:

(= json-object
   (on-result listtab
     (seq2 (match-is #\{)
           (comma-separated json-object-kv)
           (match-is #\}))))

And now I can add json-object to json-value:

(= json-value
  (alt json-string
       json-number
       json-object
       json-array
       json-true
       json-false
       json-null))
arc> (show-parse json-value "{\"a\":[1,2,{\"b\":3}]}")
match: #hash(("a" . (1 2 #hash(("b" . 3)) . nil))) remaining: 
nil

fromjson

fromjson is the entry point into the JSON parser library. It takes a string, parses it as a JSON value, and converts it to an Arc value:

(def fromjson (s)
  (iflet (p r) (json-value (coerce s 'cons))
    (do (if p (err "Unexpected characters after JSON value" (coerce p 'string)))
        r)
    (err "not a JSON value" s)))
arc> (fromjson "{\"a\":[1,2,{\"b\":3}]}")
#hash(("a" . (1 2 #hash(("b" . 3)) . nil)))

a work in progress

Here’s the complete JSON parser library as I’ve gotten it to so far. The parser combinator library code can be found at parsecomb0.arc.

There’s a few things left to finish: the JSON specification says that whitespace can be inserted between any pair of tokens, and I could add more must clauses so that, for example, if there’s a missing close quote in a string, we get a more specific error message than that the whole input is “not a JSON value”.

Although it’s not really a valid comparison until the library is done, it’s still fun to compare this implementation side-by-side with one of the many JSON libraries listed at the bottom of http://json.org/.

(= json-true  (match-literal "true"  t))
(= json-false (match-literal "false" nil))
(= json-null  (match-literal "null"  nil))

(= json-number-char
  (match [find _ ".-+eE1234567890"]))

(= json-number
  (with-result cs (many1 json-number-char)
    (coerce (coerce cs 'string) 'num)))

(def hexdigit (c)
  (and (isa c 'char)
       (or (<= #\a c #\f) (<= #\A c #\F) (<= #\0 c #\9))))

(= fourhex
  (must "four hex digits required after \\u"  
    (with-seq (h1 (match hexdigit)
               h2 (match hexdigit)
               h3 (match hexdigit)
               h4 (match hexdigit))
      (coerce (int (coerce (list h1 h2 h3 h4) 'string) 16) 'char))))

(def json-backslash-char (c)
  (case c
    #\" #\"
    #\\ #\\
    #\/ #\/
    #\b #\backspace
    #\f #\page
    #\n #\newline
    #\r #\return
    #\t #\tab
    (err "invalid backslash char" c)))

(= json-backslash-escape
  (seq2 (match-is #\\)
        (alt (seq2 (match-is #\u)
                   fourhex)
             (fn (p)
               (return cdr.p (json-backslash-char car.p))))))

(= json-string
  (with-result cs (seq2 (match-is #\")
                        (many (alt json-backslash-escape
                                   (match [isnt _ #\"])))
                        (match-is #\"))
    (coerce cs 'string)))

(= json-array
  (seq2 (match-is #\[)
        (comma-separated forward.json-value)
        (match-is #\])))

(= json-object-kv
  (with-seq (key   json-string
             colon (match-is #\:)
             value forward.json-value)
    (list key value)))

(= json-object
   (on-result listtab
     (seq2 (match-is #\{)
           (comma-separated json-object-kv)
           (match-is #\}))))

(= json-value
  (alt json-string
       json-number
       json-object
       json-array
       json-true
       json-false
       json-null))

(def fromjson (s)
  (iflet (p r) (json-value (coerce s 'cons))
    (do (if p (err "Unexpected characters after JSON value" (coerce p 'string)))
        r)
    (err "not a JSON value" s)))

comment

Comment in the Arc Forum.