Can we simplify collection of Menhir $startpos $endpos?

Hi everyone,

This is mostly a question to satisfy my curiosity, and not to fix any bugs.

I’m writing a compiler in OCaml and Menhir parser and am pleasantly happy with OCaml.

This is my parser file and I’m referring to an issue in this file: [nxn / lib / nxnParser.mly]

Using Menhir’s $startpos and $endpos values I was able to tag my AST nodes with location information, then use them to report errors in my code.

To help with this, I created these helper functions and they work well,

 let location (x: Lexing.position) =
    let lnum = x.pos_lnum in
    let cnum = x.pos_cnum - x.pos_bol in
    let loc = Ast.Loc{lnum=lnum; cnum=cnum} in
    loc
  ;;

  let position (startpos: Lexing.position) (endpos: Lexing.position) =
    let start = location startpos in
    let end' = location endpos in
    let pos = Ast.Pos{start=start; end'=end'} in
    pos
  ;;

My issue is, to use this position function, I need to explicitly pass $startpos and $endpos everywhere in my parser grammar, while calling the position function.

id:
  | i=IDVAL; { Ast.Id {value=i; pos=(position $startpos $endpos)} }

I don’t mind doing this, since it’s stable, and works well for my task.

But it would be nice if there was some way to simplify the pos=(position $startpos $endpos) into something simple like pos=p where p would contain the value returned from the position function.

My intention is to reduce the amount of boiler plate code required to write my parser.

Thanks

A typical approach is to change the type definition from

type expr =
  | Id of {value: string; pos: position}
  | ... of { ... ; pos: position }
  | ... of { ... ; pos: position }

into:

type expr = { desc: expr_desc; pos: position }
and expr_desc =
  | Id of {value: string}
  | ...
  | ...

One can then mirror this factoring in the parser with:

expr: expr_desc { {desc = $1; pos = position $startpos $endpos} }
;;
expr_desc:
| i=IDVAL { Ast.Id {value = i} }
| ...
;;

Makes sense?

Cheers,
Nicolas

1 Like

Another way that I would tackle this problem is by simply using Menhir’s $loc (a shorthand for ($startpos, $endpos)), and defining your AST similar to:

type position = Lexing.position * Lexing.position

type expr =
  | Id of {value: string; pos: position}
  | ... of { ... ; pos: position }
  | ... of { ... ; pos: position }
id:
  | i=IDVAL; { Ast.Id {value=i; pos=$loc} }

If you only need the position information for reporting errors, then you can save calling your position function for when you construct an error message. Another reason to save calling position for when you construct an error message is that you’ll need to do this anyway for errors raised by the parser itself.

More generally, beyond locations and error messages, I think it’s easiest when your parser mostly just constructs data with minimal extra logic, and any enhancements to it are done by the compiler’s later steps.

2 Likes

I really like your idea!

Thanks

 let location (x: Lexing.position) =
    let lnum = x.pos_lnum in
    let cnum = x.pos_cnum - x.pos_bol in
    let loc = Ast.Loc{lnum=lnum; cnum=cnum} in
    loc
  ;;

  let position (startpos: Lexing.position) (endpos: Lexing.position) =
    let start = location startpos in
    let end' = location endpos in
    let pos = Ast.Pos{start=start; end'=end'} in
    pos
  ;;

  let pos (loc: Lexing.position * Lexing.position) =
    let startpos, endpos = loc in
    position startpos endpos
  ;;

id:
    | i=IDVAL; { Ast.Id {value=i; pos=(pos $loc)} }

1 Like