Recursion in Menhir lexer for small DSL

Chet_Murthy · October 2, 2022, 9:42pm

Um, that regexp will match

/** foo bar */
....
...
/** goo boo */
...

by going from “foo” all the way thru to “boo”, no? You want a slightly more-complex regexp in the middle (instead of _*)

olleharstedt · October 2, 2022, 9:43pm

Hey Chet, yeah, those are all valid docblocks too, but should return empty list. Other valid examples:

/**
 * Mo mo mo, some info
 * @return void
 */

or

/** bla
 * @param array<string, int> $ar
more bla bla */

Chet_Murthy · October 2, 2022, 9:44pm

No, I mean that you’ll end up grabbing both docblocks, and all the code in-between, won’t you ? Lex will look for the longest-match, right?

olleharstedt · October 2, 2022, 9:46pm

Ah crap, you’re right. Thanks, will fix.

Chet_Murthy · October 2, 2022, 9:47pm

Instead of _*, maybe you want something like:

( [^ '*'] | '*'+ [^ '/' '*'] )*

[I’m doing this on-the-fly, so I could be making a mistake here]
The idea is, you want the complement of the language of “*/”.

At least, I think that’s how it works – been so long I don’t quite remember anymore.

olleharstedt · October 2, 2022, 9:48pm

I’ll check some docs if I can match the shortest possible string instead.

Chet_Murthy · October 2, 2022, 9:48pm

Oh wait, and then at the end, you have '*'* – right before “*/” .

olleharstedt · October 2, 2022, 10:20pm

It’s only the character combination */ that stops the comment. I made a buffer now instead, with a separate rule. Didn’t find anything about non-greedy matching in ocamllex.

Chet_Murthy · October 2, 2022, 10:23pm

Yes, but the string “/" also matches the regexp "_”. So the input

/** abc */ x y z /** def */

will yield a single docblock, containing the entire line. Or at least, IIRC, that’s how lex will work.

olleharstedt · October 4, 2022, 9:44pm

For closure, this is what I ended up with:

and docblock_comment buffer = parse
  | "*/"                          { DOCBLOCK_AS_STR (Buffer.contents buffer) }
  | '\n'                          { new_line lexbuf; docblock_comment buffer lexbuf }
  | whitespace_char_no_newline+   { docblock_comment buffer lexbuf }
  | _? as s                       { Buffer.add_string buffer s; docblock_comment buffer lexbuf }
  | eof                           { failwith "unterminated docblock" }

Chet_Murthy · October 4, 2022, 9:58pm

Ah, that should work (IIRC b/c) longest-match wins, and that’s “*/” .

It is simpler to write what you wrote, than to calculate out the regexp, even if (to me) the regexp is … more satisfying grin.

olleharstedt · October 5, 2022, 4:08pm

I didn’t find any way to make a non-greedy regexp, so. No choice.

Topic		Replies	Views
Call Menhir-generated parser without a lexbuf Learning menhir	8	594	October 2, 2022
How to parse a token with a list argument in Menhir? Learning menhir	7	1486	October 2, 2022
Lexer - How to handle conflicts? (Menhir) Learning parsing , lexer	2	297	December 26, 2023
Lookahead in menhir parser Learning	8	2180	February 13, 2019
Menhir conflict for Julia grammar Learning menhir	4	1227	November 5, 2020

Recursion in Menhir lexer for small DSL

Related topics