Is it feasible to write parsers without using polymorphic variants for AST representation?

I’ve been playing with my parser and trying to formalize its result better by using types which are not polymorphic variants… though it seems that as hard as I try, it’s impossible (or very ugly) to do it without going through polymorphic variants at one stage of the parsing process or another.

Also, I don’t see to be gaining any performance by using closed variants, which shouldn’t be the case.

For example, let’s see this parser which parses a number:

A version with polymorphic variants:

open Angstrom
let is_digit = function
  | '0' .. '9' -> true
  | _ -> false

let integer =
  take_while1 is_digit
  >>= fun s ->
  return (`NumericPart s)

let number =
  list [
    (peek_char >>= function
    | Some '-' -> advance 1 >>= fun _ -> return (`Negative)
    | Some '+' -> advance 1 >>= fun _ -> return (`Positive)
    | _ -> return (`Positive));
    integer <?> "Number expected";
    peek_char >>= function
      | None -> return (`NumericPart "0")
      | Some '.' -> (char '.' *> integer)
      | Some _ -> return (`NumericPart "0")
  ]
  >>= function
  | [sign; `NumericPart whole; `NumericPart part] ->
    let  signed_whole = match sign with
      | `Negative -> "-" ^ whole
      | `Positive -> whole
      | _ -> assert false
    in
    let len_part = (String.length part) in
    let denom = (Int.pow 10 len_part) in
    return (`Fraction (int_of_string (signed_whole ^ part) , denom))
  | _ -> fail "Number parser shouldn't happen"

And here is with closed variants:

open Angstrom

type sign =
  | Positive
  | Negative

type num_parts =
  | Sign of sign
  | NumericPart of string

let is_digit = function
  | '0' .. '9' -> true
  | _ -> false

let integer =
  take_while1 is_digit
  >>= fun s ->
  return (NumericPart s)

let number =
  list [
    (peek_char >>= function
    | Some '-' -> advance 1 >>= fun _ -> return (Sign Negative)
    | Some '+' -> advance 1 >>= fun _ -> return (Sign Positive)
    | _ -> return (Sign Positive));
    integer <?> "Number expected";
    peek_char >>= function
      | None -> return (NumericPart "0")
      | Some '.' -> (char '.' *> integer)
      | Some _ -> return (NumericPart "0")
  ]
  >>= function
  | [sign; NumericPart whole; NumericPart part] ->
    let  signed_whole = match sign with
      | Sign Negative -> "-" ^ whole
      | Sign Positive -> whole
      | _ -> assert false
    in
    let len_part = (String.length part) in
    let denom = (Int.pow 10 len_part) in
    return (`Fraction (int_of_string (signed_whole ^ part) , denom))
  | _ -> fail "Number parser shouldn't happen"

So because I have to put heterogenous things in nested lists (the AST), it becomes very tedious and explicit (hence ugly) if I try to use regular variants. And it also doesn’t seem to be able to buy me a lot of type safety, as I still regularly find myself writing catch-all cases which can’t happen in practice.

Also, to my surprise, by using closed variants, my program doesn’t really work faster. I tested on huge files and there isn’t really any difference in speed.

So I’m not sure what my question is here. But I’d be happy to discuss ways of improving this situation if there are any at all.

Your post actually contains the answer (partial at least), as I can easily understand the second snippet of code, while the first, that uses polymorphic variants, takes some effort to get grasp of. The reason of this, is that types form a terse and easy to understand high level specification language, that makes code reading and understanding much easier. As Fred Brooks said “Show me your [code] and conceal your [data structures], and I shall continue to be mystified. Show me your [data structures], and I won’t usually need your [code]; it’ll be obvious.” Thus the more types you’re using the more readable your code is1. It may involve extra typing (pun intended) but as a nice bonus you get one guy who will faithfully read all your annotations and even proof that they are correct - it is your friend, the type checker.

Now the full answer. While indeed the second code snippet is more readable, it doesn’t really mean, that this is the disadvantage of polymorphic variants. You can (and should) use types and type annotations with polymorphic variants. Moreover, polymorphic variants provide more advanced typing discipline, so you can actually express more complex ideas with them. On the other hand, if you can express the same idea with the regular variants, then you shouldn’t invoke a more powerful tool, employing the Occam’s razor principle.

The advanced type discipline that I’m referring is sub-typing. The mechanism highly overestimated and heavily abused in the computer industry, though still useful in many cases. While regular variants provide some notion of subtyping with a fixed and rigid hierarchy (that basically contains only one level of inheritance), the polymorphic variants allow us to build arbitrary data hierarchies with the [>] object in the top of it. Thus polymorphic variants facilitate code reuse, though the latter often implies bug reuse due to the stretching of the code over too many places.

Moreover, the arsenal of tools that we can use for building parsers is actually much bigger, than the choice between two sorts of variants - we also have extensible variants and GADTs. Both are extensions of the regular variants, and together they provide a very advanced typing discipline. Moreover, if you will dare to employ all four of them, you will be able to specify very complex behaviors. That may make your code more readable and safer. However, oftentimes an amalgamation of those four variants yield the code that is much harder to understand. Again, there is no silver bullet, and bringing too powerful weapon will not always give you advantages.

Finally, there is the fifth option. You can use the Tagless Final Style as proposed and thoroughly explained by Oleg Kiselyov. Roughly, the idea is that instead of hard coding a particular data representation (and making the hard choice between different genera of variants) you may just parametrize your code with a module that provides a function for each production. This will make your code really generic as your parser/interpreter is now dealing only with the semantics of the language not with the representation. Since this approach is using functors it invokes even more powerful mechanisms that allow you to express even more complex ideas, while at the same time facilitating code reuse and gradual refinement of your system.

With all that said, I would like also to note, that there are a few improvements that you can apply to your code, that may make it more readable and typable regardless of which variant of variants you will decide to choose. First of all, you’re using a monoprhic list in a place where you can easily use a tuple, or even better a record. I would write my type specification as follows

type sign = Pos | Neg
type number = {sign : sign option; whole : string; part : string}

I would also try to unpack as much of the structure in my parser as viable, e.g., instead of writing

 | [sign; NumericPart whole; NumericPart part] ->
    let  signed_whole = match sign with
      | Sign Negative -> "-" ^ whole
      | Sign Positive -> whole
      | _ -> assert false
    in

I would write

[Sign sign; NumericPart whole; NumericPart part] -> 
  let sign = if s = Negative then "-" else "" in
  let denom = Int.pow 10 (String.length part) in
  let whole = int_of_string (sign^whole) in
 `Fraction (whole,part)

Finally, you’re littering your code with too many parentheses in the places where they are not needed. This hampers the readability.


1) Of course, nowadays, when we have stuff like Merlin, that is nicely integrated into Emacs and VS Code, an explicit presence of types is not strictly necessary as you may interactively query for types from your IDE. However, the IDE is not always available, cf. code review on GitHub, just reading the code without compiling it, etc.

5 Likes

Thank you for your thorough answer!
I think I was trying to reinvent something like Tagless Final Style without being aware of it. I’ll go read further on the topic

That’s because I only have the list parser generator from Angstrom and not a tuple. I think I understand the benefits but it’s not available. Or maybe there is some other way to combine the number parts, but I haven’t figured out how to do that without using Angstrom.list

Of course, that’s obvious to me - but by that point in time I had tried lots of things so the code is certainly not “final”.

Newbie errors, as I can’t still develop a good feeling for precedence… so I still put too many parentheses…

I see, so this is the root of the problem. You shall use the list combinator only to for combining parser that produce values of the same type. And when you have values of different types, then instead of trying to unify them into one type and upcast them,only to find yourself in a need to downcast them in the next step, you shall instead use combinators that preserve your types.

In general parser combinators, like Angstrom are used in a manner when you define lots of small rules and then combine them, e.g.,

type sign = Pos | Neg
type number = {sign : sign; whole : string; part : string option}

let sign = peek_char >>= function
  | Some '+' -> advance 1 >>| fun () -> Pos
  | Some '-' -> advance 1 >>| fun () -> Neg
  | _ -> return Pos

let dot = peek_char >>= function
  | Some '.' -> advance 1 >>| fun () -> true
  | _ -> return false

let number =
  sign >>= fun sign ->
  integer >>= fun whole -> 
  dot >>= fun dot ->
  if dot
  then integer >>| fun part -> {sign; whole; part = Some part}
  else return {sign; whole; part = None}

Now you don’t need to loose the types of your values, so you don’t need to use variants or any kind of subtyping. Again, subtyping is a very abused mechanism.

2 Likes

Thanks, this made lots of choices I made (and problems I had) obsolete and allowed me to figure out the purpose of the >>| combinator.

Just to answer the performance aspect, in general you won’t see much of an improvement from choosing regular variants vs polymorphic ones. Not only is one level of indirection not going to make much of a difference, the strategy used by the compiler for regular variants is actually less efficient nowadays than that used for polymorphic variants, so the final result is often about even.

Polymorphic variants use numbers that are hashes for their tags, and so they are ‘random’, while regular variants use a range of numbers depending on how many you’ve defined, e.g. 0 to 10. The compiler attempts to optimize the handling of the latter by using arrays, while the former must be compared with direct comparisons and jumps. Unfortunately, on modern processors, branch prediction makes the direct comparison strategy far more efficient. This optimization needs to be revisited at some point, but until then, you won’t see a big performance difference.

1 Like

The performance discussion seems a bit absurd to me. If you really care about performance, you should use a parser generator such as menhir, not a parser combinatory library.

1 Like

At first I tried that and I used your ocaml_parsing skeleton… but then I got lost (don’t even remember what went wrong) and tried with angstrom and kinda succeeded to do what I needed.

Now I’m giving it another go and because I’m a lot more comfortable with OCaml… I should probably try writing this with menhir one more time.

I was addressing the expected delta.

Additionally, Angstrom is supposed to be quite fast. So long as you don’t perform allocations, it’s not immediately clear to me that a generation approach will be superior to a combinator-based one performance-wise.

One comparison I do have is… a friend of mine wrote a regex based parser (although in JavaScript) for the same format, which despite being a lot more naive was only about 10-20% faster than my Angstrom based parser. I’m itching to try with Menhir but indeed, I doubt it too… that it will be really worth it.

You’re compiling to native code? Also, with flambda?

Native, yes. Though I don’t know what flambda is. Will have a look, thanks!

JavaScript and regexes are both not great tools for writing efficient parsers. It depends on the complexity of your grammar, but I would be very surprised if menhir couldn’t do significantly better.

1 Like

OK, so I should probably try indeed.

OCaml doesn’t compile pattern matching using arrays but instead uses decision trees that are directly encoded in assembly, thus they are using only jumps and comparisons. See the Optimizing Pattern Matching and Compiling Pattern Matching to Good Decision Trees as well as Matching and Switch modules in ocaml/bytecomp.

Thus thanks to the total order of the regular variant keys, they are always compiled to a more efficient code, i.e., it is O(log(N)) vs O(N). However, in case of this particular parser the difference is negligible on top of the overhead of the Angstrom state monad, excessive allocations. Moreover, the difference will really matter when the number of variants is big. Obviously if we have only two branches the produced code will not differ a lot in terms of performance.

2 Likes

I would also like to backup @smolkaj opinion with some anecdote from my practice. I once wrote the same program in different languages for parsing in real time the information about all aircraft roaming around the Earth. There are quite a few of them, by the way. I wrote several parsers, in different languages. The fastest parser combinator library that I was using was the C++ Boost.Spirit. However, a parser written using venerable OCamlyacc was faster then Spirit in several orders of magnitude. The morale is that Menhir and OCamlYacc will always be faster than any manually written parser, even if you will use the fastest language without GC and allocations. However, parser combinators give you an extra flexibility and allow you to parse more languages (including context dependent languages) and sometimes are easier to use (or are the only choice when the author of the language, that you trying to parse, never heard anything about Chomsky and his hierarchy). But of course, this extra flexibility comes with a price.

5 Likes

Also post your code on GitHub for anal[quote=“ivg, post:15, topic:1906”]
. See the Optimizing Pattern Matching and Compiling Pattern Matching to Good Decision Trees
[/quote]

Thanks for the pointers to the papers! I seemed to remember that regular variants were accelerated with array lookups from a previous discussion, but maybe I’m remembering incorrectly.

Yeah that’s true. I guess Angstrom tries to do the best it can under the constraint of a state monad. I seem to remember state monads having disastrous performance implications in haskell, even today. I wonder what flambda can contribute.

That’s really interesting. Good to know.