Your favorite Menhir tricks and fanciness

[full disclosure: I’m not a Menhir fanboi, b/c prefer LL(1). But I still recognize that it’s quite advanced and well-developed. Lots of stuff there that people don’t know commonly, but should.]

After the thread about hand-written parsers, and the back-and-forth about LALR(1) parser-generators and their historical limitations, I thought it might be interesting to see what people find really cool in Menhir, and the cool ways they use it. Or even “stuff I saw people do with Menhir, that I was quite impressed by”. I’ve read here-and-there about Menhir’s grammar-conflict checking, and also its runtime support for error-detection and recovery, but don’t actually know much about it.

I’ll go first: Menhir has support for “macros” (the %inline directive, example at page 17 in the manual ). This is wicked stuff, and I am very impressed. Here’s the example from that page:

expression:
| i = INT { i }
| e = expression; o = op; f = expression { o e f }

%inline op:
| PLUS { ( + ) }
| TIMES { ( * ) }

The nonterminal op gets inlined into the nonterminal expression in the obvious way. Wicked.

2 Likes

I quite like the possibility to make rules that have parameters : this allows menhir to have a standard library, but you actually can make your own :

option(X):
| {None}
|  x=X{Some x}

They can also be inlined

1 Like

Menhir standard library! I never knew. Pages 17-18 of the manual, btw. Lovely!

Yes its very nice !
Beside the quite obvious use cases that are given by the standard library, I have seen that feature used to add locations to anything this way :

%public %inline located(X):
x=X { Position.with_poss $startpos $endpos x }

After that, in every rule you can use located(expr) if you want an expr with its position added. Obviously the ocaml code here may vary, this is why this is not part of the stdlib.

2 Likes

My two favourite features of Menhir, when compared to other parser generators, are checkpoints (or Incremental Interface), and the ability to marshal the intermediate representations (*.cmly files), that is the grammar and the automata.

By the way Menhir is LR(1), not just LALR(1), so it can accept slightly more grammar and the conflicts are less ad hoc. Also, my two favourite features of LR are that parsing is linear and that is has the prefix property: the parser will signal an error on the first erroneous token. The consumed prefix always admits at least one continuation.

Checkpoints & Inspection

The traditional Yacc interface grabs control. You call the parser passing a lexer as arguments. The parser invokes the lexer as needed and either succeeds giving you the final semantic value, or fails without much information.

The incremental interface of Menhir (available in the table mode, by passing the --table flag) allows inverting this control flow : rather than calling a function to do the whole analysis, the parser continuation is reified as an OCaml value.

A set of functions is provided to operate on these continuations:

  • create new parsers, each entry point of the grammar becoming a value of type a checkpoint, where a is the type of semantic values.
  • advance a parser by repeatedly calling offer that takes a checkpoint and a token and can:
    • return a value of type a if parsing finished
    • return a continuation if more tokens are needed
    • reject the entry if the token cannot be consumed from the checkpoint

The parser state is immutable, so as long as semantic actions have no side effect, you can reuse a previous parser to, say, explore multiple parses. The inspection interface optionally generated by Menhir (--inspection) lets you look at the content of the parser check point (current state, current stack).

MenhirSdk & Cmly

With the --cmly flag, Menhir will produce a parser.cmly file. It contains most intermediate structures, saved for further inspection. The MenhirSdk package that allows loading these structures in an OCaml program to implement custom analyses, without having to touch Menhir implementation.

Use-cases

I used the feature above to solve many problems that commonly arise when parsing and for which a traditional LR parser falls short.

Error recovery in Merlin

Merlin has to produce an AST even in presence of errors (most of the time, some form of partial input).

By combining a static analysis using --cmly (finding sequence of actions that allows exiting from all syntactic constructs) and a dynamic one using --inspection (a flavour of Dijkstra’s shortest path that applies action on the parsing stack), Merlin can always produce an AST.

The prefix property guarantees that no information from the valid prefix is lost (this is relevant while typing because very often, the code before the cursor is correct).

The static analysis guarantee that the recovery is complete: it is always possible to recover (and points out problematic constructions if not).

The dynamic one generates short completions and tries to balance between finishing existing syntactic construction and consuming more input.

And since the analysis is mostly grammar agnostic, we have been able to use it for many OCaml and Reason versions without significant maintenance.

Parsing beyond LR

Sometimes one needs to parse languages that are not LR. Unfortunately, Menhir does not implement the GLR algorithm, so there is no support for these languages out-of-the-box.

But the parsing checkpoints let you implement backtracking manually. Yann Régis-Gianas describes such a use in Morbig. I also used that for parsing Reason and Hack languages.

It also makes it simple to express some “meta-syntactic” rules, such as “inserts ; whenever this is not ambiguous and would permit continuing the parse” (Reason does that).

Syntactic completion

The Syntactic completion PR extends Merlin completion to suggest grammatical constructions. The completions are derived automatically from the grammar, with some hints and heuristics to determine how they should be presented.

Sasano & Choi A text-based syntax completion method using LR parsing presented at PEPM21 nicely introduces this concept for LR grammar (although our algorithm is different). You can also look at the related Spoofax Completion Framework which is not based on LR).

Coming soon: declarative error message generation

I am currently working on a new framework for producing error messages using Menhir. It should come as an alternative to the existing error infrastructure, though we plan to find a way to benefit from both.

This is still a work in progress, but to give a taste of it, on

let x = 5 ;
let y = 6
let z = 7

our current prototype marks the ; as potentially problematic, because it prevents the let x = 5 from being parsed as a separate top-level definition.

Once again, this was developed outside of Menhir, by reusing the CMLY and the checkpoints.

18 Likes

One of my favourite missing features from menhir is the ability to specify abstract precedence levels with a partial order between them, as provided by Dypgen’s %relation keyword.

2 Likes

I came to like the incremental API and the declarative error reporting that you can use if you switch to the incremental API. However, merging and updating error description files is still a bit of a headache, even though it has improved, and I’m sure will keep improving.

2 Likes

My favorite feature is naturally the conflict explanation, which only requires the --explain flag to trigger (and should be the default mode, really). Here is for instance one of the conflicts you obtain if you forget to set the priority of INT higher than function application in an ML-ish language.

** Conflict (shift/reduce) in state 19.
** Token involved: INT
** This state is reached from topexpr after reading:

FUN name RIGHTARROW simple_expr

** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)

topexpr 
expr SEMISEMI 
(?)

** In state 19, looking ahead at INT, reducing production
** expr -> simple_expr
** is permitted because of the following sub-derivation:

simple_expr list_expr // lookahead token appears because list_expr can begin with INT
value // lookahead token is inherited
FUN name RIGHTARROW expr // lookahead token is inherited
                    simple_expr . 

** In state 19, looking ahead at INT, shifting is permitted
** because of the following sub-derivation:

simple_expr 
value 
FUN name RIGHTARROW expr 
                    simple_expr list_expr 
                                simple_expr 
                                value 
                                . INT 

With fairly limited know-how, you can tell that the parser doesn’t know what to do when confronted with the situation fun x -> <e> INT .Either it parses it like (fun x -> <e>) INT or fun x -> (<e> INT). Priority naturally solves that question.
The error messages are both exhaustive and concrete, expressed in terms of parsing rules, which is highly effective when debugging a grammar.

1 Like

I tend to prefer the explanations from Bison because they are less verbose and thus more readable. (Your mileage may vary.) But I agree with you that it is unfortunate that this is not the default behavior of this kind of tools.

a.y: warning: shift/reduce conflict on token NAME [-Wcounterexamples]
  Example: expr expr . NAME
  Shift derivation
    expr
    `-> expr expr
             `-> expr expr
                      `-> . NAME
  Reduce derivation
    expr
    `-> expr            expr
        `-> expr expr . `-> NAME
1 Like