Solving shift/reduce conflicts for optional trailing comma in menhir

Optional trailing commas (or other symbols) are quite a common feature in various programming languages. For example, Rust allows you to add a trailing comma at the end of an array like this ( Array and index expressions - The Rust Reference (rust-lang.org)):

I was working on supporting such a feature in a parser application but ran into problems regarding shift/reduce conflicts.

The first approach I tried is this:

%{

%}
%token LBRACK RBRACK COMMA 
%token <int> VAR

%start<int list> bracket_list
%%

bracket_list:
     LBRACK l = separated_list(COMMA, VAR) option(COMMA) RBRACK { l } 

And I got conflict warnings:

** Conflict (shift/reduce) in state 2.
** Token involved: COMMA
** This state is reached from bracket_list after reading:

LBRACK VAR

** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)

bracket_list 
(?)

** In state 2, looking ahead at COMMA, reducing production
** separated_nonempty_list(COMMA,VAR) -> VAR
** is permitted because of the following sub-derivation:

LBRACK loption(separated_nonempty_list(COMMA,VAR)) option(COMMA) RBRACK // lookahead token appears because option(COMMA) can begin with COMMA
       separated_nonempty_list(COMMA,VAR) // lookahead token is inherited
       VAR . 

** In state 2, looking ahead at COMMA, shifting is permitted
** because of the following sub-derivation:

LBRACK loption(separated_nonempty_list(COMMA,VAR)) option(COMMA) RBRACK 
       separated_nonempty_list(COMMA,VAR) 
       VAR . COMMA separated_nonempty_list(COMMA,VAR) 

My understanding is that the parser got confused because with a token VAR, it can reduce right in place, ignoring the option(COMMA) at the end. But with the token COMMA it can also shift to the next part of option(COMMA).

I then tried to specify the precedence of the token COMMA and the production so that shifting would be prioritized:

%{

%}

%token LOW_PREC
%nonassoc LOW_PREC
%token LBRACK RBRACK COMMA 
%token <int> VAR
%nonassoc COMMA // higher precedence for the comma


%start<int list> bracket_list
%%

bracket_list:
    LBRACK l = separated_list(COMMA, VAR) RBRACK { l } %prec LOW_PREC // lower precedence on this branch
    | LBRACK l = separated_list(COMMA, VAR) COMMA RBRACK{ l } 

But I got similar error messages. In the end, I tried to create my own version of separated_list with support for trailing commas.

bracket_list:
    LBRACK l = separated_nonempty_list_option_trailing(COMMA, VAR) RBRACK { l }

separated_nonempty_list_option_trailing(separator, X):
|  x = X
    { [x] }
| x = X separator xs = separated_nonempty_list(separator, X);
    { [x] @ xs }
| x = X separator { [x] }

This time, the error is finally resolved.


Here are my questions:

  1. Why did the second approach with the specified precedence not work? Is it because the parser cannot determine whether the token COMMA belongs to option(COMMA) or the separated list?
  2. I personally prefer to use more functions/macros from the standard library, so I wonder if there is any way to support trailing commas with option and separated_list.
  3. Why didn’t my third approach run into a problem? It seems to me that the parser will also get confused after reading in the token X and pointing at the token separator since it does not know whether to reduce right in place according to the first production (x=X), or shift to read in token separator so that the third production could be used (x=X separator).
2 Likes

The answer is kind of the same for all three questions. When the parser is inside the rule generated for separated_list(COMMA, VAR), as invoked from bracket_list, and it encounters a COMMA, it has two choices. Either it keeps processing the separated_list, or it leaves the rule and goes back to processing bracket_list. Unfortunately, there is no way the parser can take a meaningful decision without reading the next token (either VAR or RBRACK), which the parser is unable to. Setting a precedence does not change anything to the fact that it would still need to read one more token. But with your definition of separated_nonempty_list_option_trailing, there is no need to take a decision: When the parser encounters a COMMA, it keeps processing the rule.

1 Like

Does it work if you replace option by ioption ?

1 Like

@zapashcanon Unfortunately, this didn’t work. I’ve noticed that some conflicts could be resolved by replacing option with ioption, but I don’t really know the underlying mechanism. Can you please explain it?

I found an example regarding inlining from the documentation; I can sort of understand this example, but not really for option.

Instead of having separate outer and inner rules, inlining makes a single unified rule with many more productions. If you were to use ioption in your rule bracket_list, the rule would now have two productions: one with COMMA and one with nothing. But this cannot solve your original issue, since the parser still has to decide when to leave the rule separated_list, which it cannot do after reading only COMMA.