A common parsing problem in languages that use angle brackets for generics is that you can’t tokenize “>>” as the right shift operator as in type context it would need to be parsed as two closing angle brackets.
However when you tokenize “>>” as two angle brackets, in expression context, when matching two ">"s to parse a right shift operator, you have to make sure the two ">"s are next to each other so that you parse x >> y as expected without also parsing x > > y as a right shift.
I’m looking at existing LR(1)/LALR(1) parser generators to see if/how they handle this and I’m wondering if Menhir has a solution to this problem.
So far the only solution that I’m aware of that can be used in LR(1) and LALR(1) parser generators is that the parser can pass a set of expected token types to the lexer.
(I think this only works when you don’t have any LR(1) states that (1) expect both “>>” and “>” (2) can reduce to a type or expression, which probably holds for most grammars)
However I’m not aware of any parser generators that actually do this. tree-sitter does this, but it’s not LR(1).
I’m curious if Menhir allows this. If not, how would I handle this case in Menhir?
rshift_expr:
| e1=expr RABRACK RABRACK e2=expr
{
let p1 = $startpos($2) in
let p2 = $startpos($3) in
if p1.pos_cnum + 1 <> p2.pos_cnum then raise ...
}
I think this should managed in the lexer, not in the parser. You should have two distinct tokens for >> and > and the rules in menhir would only describe how thoses token compose together.
OCamllex use the rule of the longest match by default, and will be able do identify the two constructions without any effort:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
You can’t handle this in the lexer without extra information from the parser (such as a set of expected tokens passed by the parser to the lexer) as the lexer has no way of knowing whether two ">"s need to be combined as one token or not.
Longest match rule will always combine, which is not the right lexing in type context when you have T1<T2<T3>>.
So far the only solution that I’m aware of that can be used in LR(1) and LALR(1) parser generators is that the parser can pass a set of expected token types to the lexer.
There is a different solution, which is not that uncommon. Instead of having a token for >>, you have a token for > when it is immediately followed by some other >. Let us call RANGLE the normal > token, and RRANGLE when it is followed. In other words, the >> operator produces the sequence of tokens RRANGLE RANGLE. Then, the rule for expressions is
As I explain in my original post, generating two tokens instead of one does not solve the issue by itself. The parser then needs to distinguish two RANGLEs next to each other from two RANGLEs with whitespace in between, to be able to parse x >> y as expected without also parsing x > > y as a right shift.
My question was whether Menhir allows checking whitespace between two tokens. If not, what would be the way in Menhir to handle this.
You did not understand the solution I proposed. The input >> is tokenized as RRANGLE RANGLE, while the input > > is tokenized as RANGLE RANGLE. Notice how the first token of both sequences is different.
Thanks. I think in this particular case I think this would work, but checking the whitespace in a semantic action means the parser already decided to reduce a production, so we have to raise a parse error if there’s space between the two tokens. We can’t continue parsing.
(Or maybe Menhir allows continuing parsing from a semantic action code, without reducing?)
Ideally the parser should check the whitespace before deciding to reduce. I.e. if I have > > and it shifted the first >, it should be in a state where it expects another > right next to the current one, or one of expr tokens (or others depending on the grammar). This state would not include > with whitespace on the left as a valid shift token.
As I said, in this particular problem handling this in a semantic action is fine as > > in expression context is always a parse error (probably).
One way to do it is to maintain a state within the lexer, something like let is_type_context = ref false, and let the lexer look up this context and issue two tokens when >> is matched.
The incremental API of menhir allows to store the state of the parser and feed it tokens. With this, you can restart parsing from previous state with different tokens. Also, you can also ask whether a given state will accept a specific token.
You can checkout js_of_ocaml javascript parser for examples: