Ocamllex and menhir examples that aren't calculators

Preface:

I’m currently trying to parse a very simple document format using the above tools (and probably more than one document at a time), however all the examples I seem to find are for a calculator directly evaluating expressions. Feel like all my posts on this forum involve the words “maybe I’m a complete idiot” but

Example:

So let’s simplify my problem even more, let’s say I want parse a list of double quoted strings separated by ','.

How would I define this grammar, do I need ocamllex? Can I do this entirely inline (as the docs vaguely hint at, and a few nearly decade old posts)? Can somebody give me a simple example?

More Questions:

  • What is the recommended way to use ocamllex and menhir (ocamlyacc) together (not as a calculator lol)?
    • With dune especially!
  • What am I missing?
  • And where can I get more guidance or examples to read through.

Thanks ya’ll!

I believe that you only need ocamllex for such a simple grammar, however you can use menhir
Your menhir file would like that

%{ 
(* ocaml code here *)
%}
%token <string> STRING
%token COMMA 
%token EOF
%start <string list> main
%%
main:
    separated_list(STRING, COMMA) EOF { $1 }

And the lexer something like that :

{
open Parser
open LexerHelper
}

let atom_code = ('\\' digit (digit ?)  (digit ?)) | ("\\0x" hexdigit (hexdigit ?))
let string_printable = [' ' - '!' '#' - '~']
let string_atom = string_printable | atom_code | "\\t" | "\\r" | "\\b" | "\\n" | "\\\"" | "\\\\"
let whitespace = [' ' '\n' '\t' '\r']*

rule string accumulator = parse
 | "\"" { STRING(
            String.of_seq
              (List.to_seq
                 (List.map (char_of_atom lexbuf)
                 (List.rev accumulator)))) }
 | string_atom { string ((Lexing.lexeme lexbuf) :: accumulator) lexbuf               }
 | eof         { error "during lexing" (Position.cpos lexbuf) "Unterminated string." }

and token = parse
  | whitespace { token lexbuf }
  | "," { COMMA }
  | "\"" { string [] lexbuf } 
  | eof { EOF } 

And a lexerHelper.ml file with the following code :

let char_of_atom lexbuf atom =
  match atom with
  | {|\n|} -> '\n'
  | {|\t|} -> '\t'
  | {|\b|} -> '\b'
  | {|\r|} -> '\r'
  | {|\\|} -> '\\'
  | {|\'|} -> '\''
  | {|\"|} -> '"'
  | _ when String.length atom = 1 -> atom.[0]
  | _ when atom.[0] = {|\|}.[0] -> (
      try Char.chr (int_of_string (String.sub atom 1 (String.length atom - 1)))
      with Invalid_argument _ ->
        error "during lexing" (Position.cpos lexbuf) "" )
  | _ -> failwith "Should never happen"

Note that I did not test any of this, I just pieced it together from old code of mine. For dune rules, and general organisation, you can look at this small DSL of mine : chat_botte/src/rolelang at master · EmileTrotignon/chat_botte · GitHub.

1 Like

@EmileTrotignon

Gonna read through this in more detail later, as I’m on PST time right now where I am and it’s getting hyper late.

Thanks for the example!

1 Like

I have an example of using Menhir+OCamllex for a custom dsl here: parser · master · Kiran Gopinathan / GSDL - Gop-Scene Definition Language · GitLab

Relevant files are probably lexer.mll, parser.mly and dune.

Although more recently I’ve started using sedlex instead of ocamllex for my lexers because sedlex allows me to use normal OCaml syntax (and hence + merlin + gopcaml-mode for editing).

For other examples of Menhir parsers, the OCaml parser is also written in menhir and might be worth checking out as well.

2 Likes

Sedlex is a little complicated to plug into menhir because it does not use the same lexbuf type as ocamllex. Its a lot nicer to use though.

Yeah, good point - it does require a bit more boilerplate, although not too much nowadays.

This was all the extra code I needed to glue between the two interfaces for a recent project:

exception Error

let revised_parse lexbuf =
  let tok () =
    let tok = Lexer.token lexbuf in
    let (st,ed) = Sedlexing.lexing_positions lexbuf in
    (tok,st,ed) in
  MenhirLib.Convert.Simplified.traditional2revised
    Raw_parser.program
    tok

let parse lexbuf =
  try
    revised_parse lexbuf
  with Raw_parser.Error -> raise Error

let parse_string str =
  parse (Sedlexing.Utf8.from_string str)

edit: fixed locations

This forces you to use the table backend though ? This backend is quite slower (does not make a difference for most uses). I believe there should be a way to provide an “abstract” buffer to menhir, that is a function of type unit -> token and two of type unit -> position. But its not available yet.

I’ve found this to be a well written set of articles and reasonably complete example language.

1 Like

The is the PL Zoo where you can find many such examples.

2 Likes

Real World OCaml has an example of JSON parsing.

2 Likes

nice-parser has an example of how to parse S-expressions. It’s a tiny library that encapsulates all the typical boilerplate code, so you can just focus on lexing and parsing.

3 Likes