Handling blank lines in a list of values

Hi!

I am playing around with Menhir and Ocamllex. The format I’m trying to parse is the following

1:a:b
2:c:d

A list of colon-separated triples, where the first item is an integer, and the second and third are strings.

I’ve managed to parse this format, you can find the code here. However, I’d like for my parser to work when there are blank lines in-between data rows.

1:a:b
2:c:d


3:e:f

4:g:h

I haven’t managed to have that work properly. The best I’ve got was to have my lexer read multiple newlines as one NEWLINE token, but that messes up my next_line function that keeps the Lexing.position record up-to-date.


My analysis of the problem is that a newline can both be significant (as the end of a triple) and non-significant (as the end of an empty line). Is there a clean way to handle this ?

Uh, turns out all I needed was some rubber duck debugging. I came up with the idea that a list of values could be

  • A list of values, plus a newline and another value
| t = values ; NEWLINE ; h = value { h :: t }
  • A list of values, a newline, and no value
| t = values ; NEWLINE { t }

And it behaves as I expected. I’m still very open to remarks, suggestions and other bits of information. My ultimate goal is to write a bunch of parsers/serializers for RDF formats, the first of which is going to be N-Triples.

A potentially simpler way is simply to have your value rule end in a non-empty list of NEWLINE tokens and have your values non-terminal be simply a list of value elements, in menhir pseudo-code:

value: INT COLON STRING COLON STRING nonempty_list(NEWLINE)
values: list(value)

By the way, instead of writing your own next_line function, you can use Lexing.new_line.

Cheers,
Nicolas

1 Like

Thanks for the input!

I’ve been using RWO’s chapter on ocamllex and Menhir. It’s great, but I think it lacks a few pointers to additional resources :sweat_smile:

ISTR having this problem too, and having to adjust the line-count manually in the lexing action. It wasn’t a big deal though.