Menhir's solution for handling double angle brackets differently in type and term contexts?

osa1 · January 21, 2025, 1:17pm

A common parsing problem in languages that use angle brackets for generics is that you can’t tokenize “>>” as the right shift operator as in type context it would need to be parsed as two closing angle brackets.

However when you tokenize “>>” as two angle brackets, in expression context, when matching two ">"s to parse a right shift operator, you have to make sure the two ">"s are next to each other so that you parse x >> y as expected without also parsing x > > y as a right shift.

I’m looking at existing LR(1)/LALR(1) parser generators to see if/how they handle this and I’m wondering if Menhir has a solution to this problem.

So far the only solution that I’m aware of that can be used in LR(1) and LALR(1) parser generators is that the parser can pass a set of expected token types to the lexer.

(I think this only works when you don’t have any LR(1) states that (1) expect both “>>” and “>” (2) can reduce to a type or expression, which probably holds for most grammars)

However I’m not aware of any parser generators that actually do this. tree-sitter does this, but it’s not LR(1).

I’m curious if Menhir allows this. If not, how would I handle this case in Menhir?

WardBrian · January 21, 2025, 9:35pm

If by “this”, you mean

Then I believe this is the kind of thing you could do using the “incremental” API menhir provides

Alternatively, I think you could implement

using $startpos

E.g. your parse rule for x >> y could look like

rshift_expr:
  | e1=expr RABRACK RABRACK e2=expr
  {
    let p1 = $startpos($2) in
    let p2 = $startpos($3) in 
    if p1.pos_cnum + 1 <> p2.pos_cnum then raise ...
  }

Chimrod · January 22, 2025, 8:15am

I think this should managed in the lexer, not in the parser. You should have two distinct tokens for >> and > and the rules in menhir would only describe how thoses token compose together.

OCamllex use the rule of the longest match by default, and will be able do identify the two constructions without any effort:

If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.

osa1 · January 22, 2025, 8:32am

You can’t handle this in the lexer without extra information from the parser (such as a set of expected tokens passed by the parser to the lexer) as the lexer has no way of knowing whether two ">"s need to be combined as one token or not.

Longest match rule will always combine, which is not the right lexing in type context when you have T1<T2<T3>>.

silene · January 22, 2025, 10:28am

So far the only solution that I’m aware of that can be used in LR(1) and LALR(1) parser generators is that the parser can pass a set of expected token types to the lexer.

There is a different solution, which is not that uncommon. Instead of having a token for >>, you have a token for > when it is immediately followed by some other >. Let us call RANGLE the normal > token, and RRANGLE when it is followed. In other words, the >> operator produces the sequence of tokens RRANGLE RANGLE. Then, the rule for expressions is

expr ::= expr RRANGLE RANGLE expr

while the rule for types is

type ::= ident LANGLE type (RANGLE | RRANGLE)

Notice how the grammar is properly LR(1).

osa1 · January 22, 2025, 11:23am

As I explain in my original post, generating two tokens instead of one does not solve the issue by itself. The parser then needs to distinguish two RANGLEs next to each other from two RANGLEs with whitespace in between, to be able to parse x >> y as expected without also parsing x > > y as a right shift.

My question was whether Menhir allows checking whitespace between two tokens. If not, what would be the way in Menhir to handle this.

silene · January 22, 2025, 12:39pm

You did not understand the solution I proposed. The input >> is tokenized as RRANGLE RANGLE, while the input > > is tokenized as RANGLE RANGLE. Notice how the first token of both sequences is different.

osa1 · January 22, 2025, 1:31pm

Ah, that makes sense, thanks.

I’m curious if there are open source parsers (using Menhir or another L(AL)R) that use this approach that I can have a look?

WardBrian · January 22, 2025, 4:51pm

You can compare the values of $startpos(...).pos_cnum between the two tokens, and if they differ by more than one then whitespace was present

osa1 · January 23, 2025, 8:10am

Thanks. I think in this particular case I think this would work, but checking the whitespace in a semantic action means the parser already decided to reduce a production, so we have to raise a parse error if there’s space between the two tokens. We can’t continue parsing.

(Or maybe Menhir allows continuing parsing from a semantic action code, without reducing?)

Ideally the parser should check the whitespace before deciding to reduce. I.e. if I have > > and it shifted the first >, it should be in a state where it expects another > right next to the current one, or one of expr tokens (or others depending on the grammar). This state would not include > with whitespace on the left as a valid shift token.

As I said, in this particular problem handling this in a semantic action is fine as > > in expression context is always a parse error (probably).

silene · January 23, 2025, 8:42am

No idea. In Why3, we have something related, because we need (*) to be parsed as a sequence of three tokens, while (* would normally start a comment.

In your case, it is just a matter of adding the following line to your lexer (assuming it is an ocamllex-based lexer):

  | ">>" { backjump lexbuf 1; RRANGLE }

where backjump is defined as follows:

  let backjump lexbuf chars =
    if chars < 0 || chars > lexbuf.lex_curr_pos - lexbuf.lex_start_pos then invalid_arg "backjump";
    let pos = lexbuf.lex_curr_p in
    lexbuf.lex_curr_pos <- lexbuf.lex_curr_pos - chars;
    lexbuf.lex_curr_p <- { pos with pos_cnum = pos.pos_cnum - chars }

Then, as I wrote before, modify your parser by replacing RANGLE with (RRANGLE | RANGLE) wherever >> should be parsed as > >.

keleshev · January 23, 2025, 9:39am

Maybe something like the C “lexer hack” can work here?

Parsing Ambiguity: Type Argument v. Less Than — Vladimir Keleshev

One way to do it is to maintain a state within the lexer, something like let is_type_context = ref false, and let the lexer look up this context and issue two tokens when >> is matched.

EmileTrotignon · January 23, 2025, 2:39pm

Its probably possible to pass the context as an argument to the lexer rules.

hhugo · January 24, 2025, 11:02am

The incremental API of menhir allows to store the state of the parser and feed it tokens. With this, you can restart parsing from previous state with different tokens. Also, you can also ask whether a given state will accept a specific token.

You can checkout js_of_ocaml javascript parser for examples:

github.com/ocsigen/js_of_ocaml

compiler/lib/parse_js.ml

master

(* Js_of_ocaml compiler
 * Copyright (C) 2013 Hugo Heuzard
 *)
(* Yoann Padioleau
 *
 * Copyright (C) 2010 Facebook
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Lesser General Public License
 * version 2.1 as published by the Free Software Foundation, with the
 * special exception on linking described in file license.txt.
 *
 * This library is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the file
 * license.txt for more details.
 *)

open! Stdlib

This file has been truncated. show original

Topic		Replies	Views
Resolving shift/reduce conflict in Menhir grammar Learning menhir	2	4971	January 6, 2020
Juxtaposition in Menhir Community menhir	7	1337	March 27, 2025
Lookahead in menhir parser Learning	8	2236	February 13, 2019
Need help with a menhir grammar Learning menhir	6	1085	August 31, 2021
How to parse a token with a list argument in Menhir? Learning menhir	7	1602	October 2, 2022

Menhir's solution for handling double angle brackets differently in type and term contexts?

Related topics