Menhir and preserving comments from source

Chet_Murthy · April 19, 2019, 7:06pm

I’ve used ocamlyacc over the years a lot, and menhir in a couple of projects (including a big one I’m working on right now). I’ve also used camlp4/camlp5’s stream-parsers in a ton of projects. And of course, with ocamllex and sedlexing. I find that with stream-parsers, it’s easy to arrange for preserving lexical positions in tokens, and then carrying that across to the parse-tree. To wit,
…
type basic_token = … ;;
type token = basic_token * lexical_position_info_t ;;
…

and then in your stream parser, you pattern-match on the first component, e.g.
…
parser [< … ; '(Tstring s, _) ; … >] -> yadda yadda
…

But with menhir (and ocamlyacc) it seems like, you need to embed the lexical position info in the token, e.g.
…
type basic_token =
| Tstring of lexicai_position_info_t * string
| Tsemi of lexical_position_info_t
etc
…

Is there some trick I’m missing, for how to use camlyacc/menhir in a manner that allows preserving this positional information during the parse?

jhw · April 19, 2019, 11:15pm

In my Orsetto project, which contains a functional parser combinator library among many other things, there is an abstraction for representing a parsed object decorated with its source text location span. It’s easy enough to work with explicitly that I haven’t felt too itched to write a PPX for sugaring the syntax for it like the camlp4 stream parsers do.

Chet_Murthy · April 20, 2019, 12:21am

Yep, I get you: it’s really easy with stream-parsers, too, and I’ve done it a number of times. Thing is, the language for which I need this parser …
(1) had a YACC-parser for its version 1.5
(2) no longer has a YACC-parser starting at version 1.6 (today 1.12.x)
(3) has a (shall we say) “interesting” syntax. Some would call it “really approachable”; others would say “a hodgepodge of special cases that make people who know anything about parsing, cringe”
(4) I don’t want to actually understand the grammar (which would be necessary to convert it from LALR(1) to LL(1), b/c again, “geez did you guys take meth when you were designing the syntax?”
(6) so I’m stuck using this YACC-grammar
(7) which, as it turns out, has all sorts of special-cases where it has to perform unholy acts with the lexer in order to parse the language (another reason converting to LL(1) would suuuuuuuck)

What’s this language? [I’d think that list of clues would suffice *grin*] Why, Golang.

Whatevs. Anyway, yeah, I wish I had an LL(1) grammar for the thing (say, in ANTLR, for instance). But I only have this YACC-grammar.

gasche · April 20, 2019, 8:13am

To have location/position information in the AST: the standard approach I’m familiar with is not to embed position information in the tokens, but to query it from the lexer or parser at the place where you build your AST values in the parser actions. When using ocamlyacc, I use the Lexing module for this (Lexing.lexeme_{start,end}_p), when using Menhir I use its special symbols ${start,end}pos, ${start,end}pos(n), $loc, $loc(n).

To preserve comments, an approach we use in the OCaml compiler (where comments that are docstrings are kept in the AST) is to have a global table of comments, that is filled by the Lexer, and accessed from parsing actions (there is a function that says basically “collect all the comments from the last time you were called to this position”).

Chet_Murthy · April 20, 2019, 5:13pm

Thank you for this pointer! It’s been so long ago I started using ocamlyacc, I’d forgotten (or never learned) that it had these special variables, and I use menhir basically by copying my old camlyacc files and making the minimal necessary changes (so I never fully read the Menhir manual). Sorry, I’ll do that now!

ryanslade · April 20, 2019, 7:21pm

I know this doesn’t answer your question, but could you not use the parser that comes with Go to output the AST in a format that is easier to parse using Menhir?

See: https://golang.org/pkg/go/parser/

Chet_Murthy · April 20, 2019, 7:51pm

It’s an excellent question, and I considered it. Two problems:

(1) this idea ties me to the Golang infrastructure and feh/feh/feh I hate that fscking language (haha)

(2) Eventually I want to start making modifications to the grammar, in order to (um) support new language features. So I want to be starting with my own Golang parser.

A third one: (3) the hacky-ness of the grammar is somewhat paralleled by having to be hacky during pretty-printing. For example, there is no syntactic difference between a function-application and a type-cast. So “F(e)” could be either. And “*f(e)” could either be “apply f to e, then dereference the pointer” or a mis-parenthesized type-cast (of the type “*f” (pointer to type “f”), which should have been written “(*f)(e)”.

And being able to pretty-print back out code, so I can round-trip at various points, is a useful way of building tests. My Golang code-corpus is the source distribution, so it’s a lot of code, and that’s been really useful (e.g. to find undocumented nooks-and-crannies of the “intended specification” that never made it into the “online specification”). Running away from the grammar would be counterproductive to that.

Yeah: I know, “what a steaming pile; who could come up with such a crazy syntax?” So it goes.

Topic		Replies	Views
Your favorite Menhir tricks and fanciness Ecosystem menhir	15	5332	December 3, 2021
Define Literals on Parser using ocamlyacc / menhir Learning	7	601	September 29, 2021
Ocamllex and menhir examples that aren't calculators Learning menhir , ocamllex , ocamlyacc	10	3738	November 20, 2021
Ppx_parser OCaml5: how to track token position? Learning ppx	6	245	March 30, 2025
How to display location info from a parser? Learning	5	690	February 2, 2019

Menhir and preserving comments from source

Related topics