Menhir and preserving comments from source

I’ve used ocamlyacc over the years a lot, and menhir in a couple of projects (including a big one I’m working on right now). I’ve also used camlp4/camlp5’s stream-parsers in a ton of projects. And of course, with ocamllex and sedlexing. I find that with stream-parsers, it’s easy to arrange for preserving lexical positions in tokens, and then carrying that across to the parse-tree. To wit,

type basic_token = … ;;
type token = basic_token * lexical_position_info_t ;;

and then in your stream parser, you pattern-match on the first component, e.g.

parser [< … ; '(Tstring s, _) ; … >] -> yadda yadda

But with menhir (and ocamlyacc) it seems like, you need to embed the lexical position info in the token, e.g.

type basic_token =
| Tstring of lexicai_position_info_t * string
| Tsemi of lexical_position_info_t
etc

Is there some trick I’m missing, for how to use camlyacc/menhir in a manner that allows preserving this positional information during the parse?

1 Like

In my Orsetto project, which contains a functional parser combinator library among many other things, there is an abstraction for representing a parsed object decorated with its source text location span. It’s easy enough to work with explicitly that I haven’t felt too itched to write a PPX for sugaring the syntax for it like the camlp4 stream parsers do.

Yep, I get you: it’s really easy with stream-parsers, too, and I’ve done it a number of times. Thing is, the language for which I need this parser …
(1) had a YACC-parser for its version 1.5
(2) no longer has a YACC-parser starting at version 1.6 (today 1.12.x)
(3) has a (shall we say) “interesting” syntax. Some would call it “really approachable”; others would say “a hodgepodge of special cases that make people who know anything about parsing, cringe
(4) I don’t want to actually understand the grammar (which would be necessary to convert it from LALR(1) to LL(1), b/c again, “geez did you guys take meth when you were designing the syntax?”
(6) so I’m stuck using this YACC-grammar
(7) which, as it turns out, has all sorts of special-cases where it has to perform unholy acts with the lexer in order to parse the language (another reason converting to LL(1) would suuuuuuuck)

What’s this language? [I’d think that list of clues would suffice *grin*] Why, Golang.

Whatevs. Anyway, yeah, I wish I had an LL(1) grammar for the thing (say, in ANTLR, for instance). But I only have this YACC-grammar.

To have location/position information in the AST: the standard approach I’m familiar with is not to embed position information in the tokens, but to query it from the lexer or parser at the place where you build your AST values in the parser actions. When using ocamlyacc, I use the Lexing module for this (Lexing.lexeme_{start,end}_p), when using Menhir I use its special symbols ${start,end}pos, ${start,end}pos(n), $loc, $loc(n).

To preserve comments, an approach we use in the OCaml compiler (where comments that are docstrings are kept in the AST) is to have a global table of comments, that is filled by the Lexer, and accessed from parsing actions (there is a function that says basically “collect all the comments from the last time you were called to this position”).

2 Likes

Thank you for this pointer! It’s been so long ago I started using ocamlyacc, I’d forgotten (or never learned) that it had these special variables, and I use menhir basically by copying my old camlyacc files and making the minimal necessary changes (so I never fully read the Menhir manual). Sorry, I’ll do that now!

I know this doesn’t answer your question, but could you not use the parser that comes with Go to output the AST in a format that is easier to parse using Menhir?

See: https://golang.org/pkg/go/parser/

It’s an excellent question, and I considered it. Two problems:

(1) this idea ties me to the Golang infrastructure and feh/feh/feh I hate that fscking language (haha)

(2) Eventually I want to start making modifications to the grammar, in order to (um) support new language features. So I want to be starting with my own Golang parser.

A third one: (3) the hacky-ness of the grammar is somewhat paralleled by having to be hacky during pretty-printing. For example, there is no syntactic difference between a function-application and a type-cast. So “F(e)” could be either. And “*f(e)” could either be “apply f to e, then dereference the pointer” or a mis-parenthesized type-cast (of the type “*f” (pointer to type “f”), which should have been written “(*f)(e)”.

And being able to pretty-print back out code, so I can round-trip at various points, is a useful way of building tests. My Golang code-corpus is the source distribution, so it’s a lot of code, and that’s been really useful (e.g. to find undocumented nooks-and-crannies of the “intended specification” that never made it into the “online specification”). Running away from the grammar would be counterproductive to that.

Yeah: I know, “what a steaming pile; who could come up with such a crazy syntax?” So it goes.