Menhir's solution for handling double angle brackets differently in type and term contexts?

A common parsing problem in languages that use angle brackets for generics is that you can’t tokenize “>>” as the right shift operator as in type context it would need to be parsed as two closing angle brackets.

However when you tokenize “>>” as two angle brackets, in expression context, when matching two ">"s to parse a right shift operator, you have to make sure the two ">"s are next to each other so that you parse x >> y as expected without also parsing x > > y as a right shift.

I’m looking at existing LR(1)/LALR(1) parser generators to see if/how they handle this and I’m wondering if Menhir has a solution to this problem.

So far the only solution that I’m aware of that can be used in LR(1) and LALR(1) parser generators is that the parser can pass a set of expected token types to the lexer.

(I think this only works when you don’t have any LR(1) states that (1) expect both “>>” and “>” (2) can reduce to a type or expression, which probably holds for most grammars)

However I’m not aware of any parser generators that actually do this. tree-sitter does this, but it’s not LR(1).

I’m curious if Menhir allows this. If not, how would I handle this case in Menhir?

If by “this”, you mean

Then I believe this is the kind of thing you could do using the “incremental” API menhir provides


Alternatively, I think you could implement

using $startpos

E.g. your parse rule for x >> y could look like

rshift_expr:
  | e1=expr RABRACK RABRACK e2=expr
  {
    let p1 = $startpos($2) in
    let p2 = $startpos($3) in 
    if p1.pos_cnum + 1 <> p2.pos_cnum then raise ...
  }

I think this should managed in the lexer, not in the parser. You should have two distinct tokens for >> and > and the rules in menhir would only describe how thoses token compose together.

OCamllex use the rule of the longest match by default, and will be able do identify the two constructions without any effort:

If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.

You can’t handle this in the lexer without extra information from the parser (such as a set of expected tokens passed by the parser to the lexer) as the lexer has no way of knowing whether two ">"s need to be combined as one token or not.

Longest match rule will always combine, which is not the right lexing in type context when you have T1<T2<T3>>.

So far the only solution that I’m aware of that can be used in LR(1) and LALR(1) parser generators is that the parser can pass a set of expected token types to the lexer.

There is a different solution, which is not that uncommon. Instead of having a token for >>, you have a token for > when it is immediately followed by some other >. Let us call RANGLE the normal > token, and RRANGLE when it is followed. In other words, the >> operator produces the sequence of tokens RRANGLE RANGLE. Then, the rule for expressions is

expr ::= expr RRANGLE RANGLE expr

while the rule for types is

type ::= ident LANGLE type (RANGLE | RRANGLE)

Notice how the grammar is properly LR(1).

3 Likes

As I explain in my original post, generating two tokens instead of one does not solve the issue by itself. The parser then needs to distinguish two RANGLEs next to each other from two RANGLEs with whitespace in between, to be able to parse x >> y as expected without also parsing x > > y as a right shift.

My question was whether Menhir allows checking whitespace between two tokens. If not, what would be the way in Menhir to handle this.

You did not understand the solution I proposed. The input >> is tokenized as RRANGLE RANGLE, while the input > > is tokenized as RANGLE RANGLE. Notice how the first token of both sequences is different.

1 Like

Ah, that makes sense, thanks.

I’m curious if there are open source parsers (using Menhir or another L(AL)R) that use this approach that I can have a look?

You can compare the values of $startpos(...).pos_cnum between the two tokens, and if they differ by more than one then whitespace was present

Thanks. I think in this particular case I think this would work, but checking the whitespace in a semantic action means the parser already decided to reduce a production, so we have to raise a parse error if there’s space between the two tokens. We can’t continue parsing.

(Or maybe Menhir allows continuing parsing from a semantic action code, without reducing?)

Ideally the parser should check the whitespace before deciding to reduce. I.e. if I have > > and it shifted the first >, it should be in a state where it expects another > right next to the current one, or one of expr tokens (or others depending on the grammar). This state would not include > with whitespace on the left as a valid shift token.

As I said, in this particular problem handling this in a semantic action is fine as > > in expression context is always a parse error (probably).

1 Like

No idea. In Why3, we have something related, because we need (*) to be parsed as a sequence of three tokens, while (* would normally start a comment.

In your case, it is just a matter of adding the following line to your lexer (assuming it is an ocamllex-based lexer):

  | ">>" { backjump lexbuf 1; RRANGLE }

where backjump is defined as follows:

  let backjump lexbuf chars =
    if chars < 0 || chars > lexbuf.lex_curr_pos - lexbuf.lex_start_pos then invalid_arg "backjump";
    let pos = lexbuf.lex_curr_p in
    lexbuf.lex_curr_pos <- lexbuf.lex_curr_pos - chars;
    lexbuf.lex_curr_p <- { pos with pos_cnum = pos.pos_cnum - chars }

Then, as I wrote before, modify your parser by replacing RANGLE with (RRANGLE | RANGLE) wherever >> should be parsed as > >.

1 Like

Maybe something like the C “lexer hack” can work here?

Parsing Ambiguity: Type Argument v. Less Than — Vladimir Keleshev

One way to do it is to maintain a state within the lexer, something like let is_type_context = ref false, and let the lexer look up this context and issue two tokens when >> is matched.

Its probably possible to pass the context as an argument to the lexer rules.

The incremental API of menhir allows to store the state of the parser and feed it tokens. With this, you can restart parsing from previous state with different tokens. Also, you can also ask whether a given state will accept a specific token.

You can checkout js_of_ocaml javascript parser for examples:

1 Like