Parser combinators vs. parser preprocessors?

This question is elicited by the Transept post, and specifically by the linked-to Parsec paper on parser combinators. I’ve started reading hte paper, and have comparing-and-contrasting with the implementation of the Ocaml parser itself.

To wit, it seems like there are a number of places where it’s useful/important to have a preprocessor that takes a language that is most definitely isn’t made of parser-combinators, and does various syntactic transformations on it. The example that comes to mind is the heavy use of inlining, and specifically in order to both achieve modularity (of a sort) while preserving the ability for the grammar-compiler to detect and implement proper precedence rules. I’m betting there are other examples.

Also, I’m personally a massive LL(1) (over LALR) bigot, so it’s not that I’m persuaded by “yacc” or anything: it’s a general feeling that the preprocessing of a grammar by various algorithms/heuristics is valuable, and hence, that parser-combinators are not the way to go, to write language-processors.

Maybe I’m wrong. I’d love to be wrong. Those tools don’t fit well into the rest of our libraries and code, after all.

1 Like

There is a Parsec-inspired parser combinator in OCaml - Angstrom.

Combinators also describe a grammar; they can build a representation that is then processed. I think it would be perfectly reasonable to provide combinators to describe a L(AL)R grammar, and then a function from such a grammar to a parsing automaton, along with the result of various analyses. This would solve the “additional tooling” problem of typical parser generators, and also the “lack of conflict analysis” problem of typical parser combinator libraries. But it may require support for staging for performance reasons. Menhir takes a couple seconds to generate a parser from a big grammar, which is fine as a compile-time operation but I wouldn’t want to pay this cost every time I am starting an application containing the parser.

3 Likes

Readers of this thread may be interested in the asp (algebraic staged parsing) library (also described in the Transept post linked above), which is built on an approach along the lines @gasche describes:

  • combinators that describe a grammar (using context-free expressions)
  • an analysis (formulated as a type system) that ensures deterministic parsing
  • staging to eliminate performance overhead

The interface is pretty standard, with combinators for alternation, sequencing, etc., and performance is quite good (better than ocamlyacc on our benchmarks).

There’s a paper, A typed algebraic approach to parsing, that describes the design in more detail.

Grammars built using asp are essentially LL(1). (The weasel word “essentially” isn’t hiding much here, but the paper has the details.)

3 Likes