Angstrom parser optimization

I am a newbie in parsing and I am still not sure if I get all the terminology and after quite some struggle I only managed to write a suboptimal parser, though luckily, one that works.

Here is a simplified version of my problem
I need to parse lines of a file which look like this:

    SomeKey  SomeValue  ;SomeComment
    SomeKey  SomeValue  
    ;SomeComment

Some of these are optional and we can also have blank lines, so here is where I am at the moment:

open Angstrom
(*  skipping definitions of sub-parsers *)

let data_line =
  choice [
    list [comment];
    list [whitespace; comment;];
    list [whitespace; key; whitespace; value; comment];
    list [whitespace; key; whitespace; value;];
    list [whitespace; key; comment];
    list [whitespace; key;];
    list [whitespace];
  ]

This works correctly and satisfies all my sample files and the tests (what I think are all the possible combinations), but I think it’s a lot slower than it should be. In order for the parser to identify a [whitespace; key] line (which are the most common ones), it has to go over the whole line at least 3 times.

I tried playing with the Angstrom.Buffered interface and that’s probably what I need, but I wasn’t able to figure out how to make it compile. I am also not able to find examples (which I can comprehend) on the internet.

I’m not sure if you care to preserve comments in your parser result, but if you did you could implement your parser like this:

let lex p =
  p <* whitespace

let data_line =
  let some x = Some x in
  lift3 (fun x y z -> (x, y, z)
    (lex key) 
    (lex value) 
    (lex (option None (comment >>| some))))

let line =
  whitespace *>
  choice
    [ lex comment   >>| fun comment    -> `Comment comment
    ; data_line >>| fun (k, v, c)  -> `Data(k, v, c) ]

To eliminate choice entirely, you could use a bit of lookahead:

let line =
  whitespace *>
  peek_char_fail
  >>= function
    | ';' -> lex comment >>| fun comment -> `Comment comment
    | _   -> data_line   >>| fun (k, v, c) -> `Data(k, v, c)

Hope this helps.

6 Likes