Sanity check on grammar for tree language

Writing a parser for Pascal parse tree dump files, which can contain tree structures with multiple children like:

(node A
   (node B)
   (node C)
   (node D
      (node F) 
   )
)

(where B, C, and D are children of A, and F is a child of D). I’ve got a conflict-free grammar written but parsing child nodes just refuses to work, always erroring on the closing parenthesis of the child. For context, nodes can optionally have some data, and optionally have children. Here’s the core subset of my grammar:

%start main
%type <Vp_lifter.Parse_tree.parse_tree_node list> main node_list
%type <Vp_lifter.Parse_tree.parse_tree_node> node
%type <Vp_lifter.Parse_tree.parse_tree_node_type> node_type
%type <string> resultdef
%type <Vp_lifter.Parse_tree.flag list> flags flags_list
%type <(string * Vp_lifter.Parse_tree.pt_vtype) list> data_list
%type <(string * Vp_lifter.Parse_tree.pt_vtype)> data
%type <unit> eq_col

%%

main :
      EOF  { [] }
    | node { [$1] }

node_list : { [] }
    | node node_list { $1 :: $2 }

node :
    LEFT_PARENTHESIS 
        node_type COMMA
        resultdef COMMA
        POS EQUALS LEFT_PARENTHESIS NUMBER COMMA NUMBER RIGHT_PARENTHESIS COMMA
        LOC EQUALS IDENTIFIER COMMA
        EXPECTLOC EQUALS IDENTIFIER COMMA
        FLAGS EQUALS flags COMMA
        CMPLX EQUALS NUMBER
        data_list
        node_list
    RIGHT_PARENTHESIS  
    {
        { 
            pt_type = $2;
            resultdef = return_type_of_string $4;
            pos = ($9, $11);
            loc = loc_of_string $16;
            expectloc = loc_of_string $20;
            flags = $24;
            cmplx = $28;
            data = $29;
            children = $30;
        }
    }

data_list : { [] }
    | NIL data_list     { $2 }
    | data data_list    { $1 :: $2 }

eq_col : EQUALS { () } | COLON { () }

data :
      IDENTIFIER eq_col IDENTIFIER  { ($1, String $3) }
    | IDENTIFIER EQUALS NUMBER      { ($1, Integer $3) }

and some sample input:

(blockn, resultdef = $void = "untyped", pos = (3,6), loc = LOC_INVALID, expectloc = LOC_INVALID, flags = [], cmplx = 0
         (loadn, resultdef = LongWord = "DWord", pos = (7,5), loc = LOC_INVALID, expectloc = LOC_INVALID, flags = [nf_write], cmplx = 1
            nil
            symbol = X
         ) <-- 'unexpected token ")"' error here
)

Here, nil is a data, as well as symbol = X. Children always come after data in the dump files.

Just hoping to get a sanity check here. It seems like I’ve implemented optional children + data properly but I’ve been spinning my wheels for a while now.

Your best bet is to shrink your grammar while still being able to reproduce the issue (for example, get rid of all the stuff between LEFT_PARENTHESIS and data_list). The problem will then be easy to spot.

Cheers,
Nicolas

1 Like