Optional trailing commas (or other symbols) are quite a common feature in various programming languages. For example, Rust allows you to add a trailing comma at the end of an array like this ( Array and index expressions - The Rust Reference (rust-lang.org)):
I was working on supporting such a feature in a parser application but ran into problems regarding shift/reduce conflicts.
The first approach I tried is this:
%{
%}
%token LBRACK RBRACK COMMA
%token <int> VAR
%start<int list> bracket_list
%%
bracket_list:
LBRACK l = separated_list(COMMA, VAR) option(COMMA) RBRACK { l }
And I got conflict warnings:
** Conflict (shift/reduce) in state 2.
** Token involved: COMMA
** This state is reached from bracket_list after reading:
LBRACK VAR
** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)
bracket_list
(?)
** In state 2, looking ahead at COMMA, reducing production
** separated_nonempty_list(COMMA,VAR) -> VAR
** is permitted because of the following sub-derivation:
LBRACK loption(separated_nonempty_list(COMMA,VAR)) option(COMMA) RBRACK // lookahead token appears because option(COMMA) can begin with COMMA
separated_nonempty_list(COMMA,VAR) // lookahead token is inherited
VAR .
** In state 2, looking ahead at COMMA, shifting is permitted
** because of the following sub-derivation:
LBRACK loption(separated_nonempty_list(COMMA,VAR)) option(COMMA) RBRACK
separated_nonempty_list(COMMA,VAR)
VAR . COMMA separated_nonempty_list(COMMA,VAR)
My understanding is that the parser got confused because with a token VAR
, it can reduce right in place, ignoring the option(COMMA)
at the end. But with the token COMMA
it can also shift to the next part of option(COMMA)
.
I then tried to specify the precedence of the token COMMA
and the production so that shifting would be prioritized:
%{
%}
%token LOW_PREC
%nonassoc LOW_PREC
%token LBRACK RBRACK COMMA
%token <int> VAR
%nonassoc COMMA // higher precedence for the comma
%start<int list> bracket_list
%%
bracket_list:
LBRACK l = separated_list(COMMA, VAR) RBRACK { l } %prec LOW_PREC // lower precedence on this branch
| LBRACK l = separated_list(COMMA, VAR) COMMA RBRACK{ l }
But I got similar error messages. In the end, I tried to create my own version of separated_list
with support for trailing commas.
bracket_list:
LBRACK l = separated_nonempty_list_option_trailing(COMMA, VAR) RBRACK { l }
separated_nonempty_list_option_trailing(separator, X):
| x = X
{ [x] }
| x = X separator xs = separated_nonempty_list(separator, X);
{ [x] @ xs }
| x = X separator { [x] }
This time, the error is finally resolved.
Here are my questions:
- Why did the second approach with the specified precedence not work? Is it because the parser cannot determine whether the token
COMMA
belongs tooption(COMMA)
or the separated list? - I personally prefer to use more functions/macros from the standard library, so I wonder if there is any way to support trailing commas with
option
andseparated_list
. - Why didn’t my third approach run into a problem? It seems to me that the parser will also get confused after reading in the token
X
and pointing at the tokenseparator
since it does not know whether to reduce right in place according to the first production (x=X
), or shift to read in tokenseparator
so that the third production could be used (x=X separator
).