My two favourite features of Menhir, when compared to other parser generators, are checkpoints (or Incremental Interface), and the ability to marshal the intermediate representations (*.cmly files), that is the grammar and the automata.
By the way Menhir is LR(1), not just LALR(1), so it can accept slightly more grammar and the conflicts are less ad hoc. Also, my two favourite features of LR are that parsing is linear and that is has the prefix property: the parser will signal an error on the first erroneous token. The consumed prefix always admits at least one continuation.
Checkpoints & Inspection
The traditional Yacc interface grabs control. You call the parser passing a lexer as arguments. The parser invokes the lexer as needed and either succeeds giving you the final semantic value, or fails without much information.
The incremental interface of Menhir (available in the table mode, by passing the --table flag) allows inverting this control flow : rather than calling a function to do the whole analysis, the parser continuation is reified as an OCaml value.
A set of functions is provided to operate on these continuations:
- create new parsers, each entry point of the grammar becoming a value of type
a checkpoint, where
a is the type of semantic values.
- advance a parser by repeatedly calling
offer that takes a checkpoint and a token and can:
- return a value of type
a if parsing finished
- return a continuation if more tokens are needed
- reject the entry if the token cannot be consumed from the checkpoint
The parser state is immutable, so as long as semantic actions have no side effect, you can reuse a previous parser to, say, explore multiple parses. The inspection interface optionally generated by Menhir (
--inspection) lets you look at the content of the parser check point (current state, current stack).
MenhirSdk & Cmly
--cmly flag, Menhir will produce a
parser.cmly file. It contains most intermediate structures, saved for further inspection. The
MenhirSdk package that allows loading these structures in an OCaml program to implement custom analyses, without having to touch Menhir implementation.
I used the feature above to solve many problems that commonly arise when parsing and for which a traditional LR parser falls short.
Error recovery in Merlin
Merlin has to produce an AST even in presence of errors (most of the time, some form of partial input).
By combining a static analysis using
--cmly (finding sequence of actions that allows exiting from all syntactic constructs) and a dynamic one using
--inspection (a flavour of Dijkstra’s shortest path that applies action on the parsing stack), Merlin can always produce an AST.
The prefix property guarantees that no information from the valid prefix is lost (this is relevant while typing because very often, the code before the cursor is correct).
The static analysis guarantee that the recovery is complete: it is always possible to recover (and points out problematic constructions if not).
The dynamic one generates short completions and tries to balance between finishing existing syntactic construction and consuming more input.
And since the analysis is mostly grammar agnostic, we have been able to use it for many OCaml and Reason versions without significant maintenance.
Parsing beyond LR
Sometimes one needs to parse languages that are not LR. Unfortunately, Menhir does not implement the GLR algorithm, so there is no support for these languages out-of-the-box.
But the parsing checkpoints let you implement backtracking manually. Yann Régis-Gianas describes such a use in Morbig. I also used that for parsing Reason and Hack languages.
It also makes it simple to express some “meta-syntactic” rules, such as “inserts
; whenever this is not ambiguous and would permit continuing the parse” (Reason does that).
The Syntactic completion PR extends Merlin completion to suggest grammatical constructions. The completions are derived automatically from the grammar, with some hints and heuristics to determine how they should be presented.
Sasano & Choi A text-based syntax completion method using LR parsing presented at PEPM21 nicely introduces this concept for LR grammar (although our algorithm is different). You can also look at the related Spoofax Completion Framework which is not based on LR).
Coming soon: declarative error message generation
I am currently working on a new framework for producing error messages using Menhir. It should come as an alternative to the existing error infrastructure, though we plan to find a way to benefit from both.
This is still a work in progress, but to give a taste of it, on
let x = 5 ;
let y = 6
let z = 7
our current prototype marks the
; as potentially problematic, because it prevents the
let x = 5 from being parsed as a separate top-level definition.
Once again, this was developed outside of Menhir, by reusing the CMLY and the checkpoints.