Define Literals on Parser using ocamlyacc / menhir

Hey!

I’m using OCaml to build a compiler and I’ve never used OCaml before. I’ve used YACC and Lex before though. I’m using ocamlyacc right now, but I can change to menhir if it facilitates the question.

My language has a lot of text. For example, I could write “My name is Alice.” and I would like to simply extract “Alice” from there. As far as I understand, the lexer has to output a token for all the words (My, name, is) or for the entire sentence (“My name is”). However, the language is quite big and defining all this partially sentences or words as tokens would easily reach the maximum size of the automata.

The lexer code:

{
  open Parser
  exception Eof
}

rule token = parse
    [' ' '\t']                  { token lexbuf }     (* skip blanks *)
  | ['\n' ]                     { EOL }
  | ['a'-'z''A'-'Z']+  as lxm   { WORD(lxm) }
  | "My name is"                { MY_NAME_IS }
  | eof                         { raise Eof }

The parser code:

%token <int> INT
%token <string> WORD
%token EOL
%token MY_NAME_IS

%start main
%type <Types.person> main

%%

main:
  MY_NAME_IS WORD EOL { { name = $2 } }
;

Is there any other way of defining these structures on the parser itself? For example, write "My name is" WORD EOL instead. Otherwise, producing tokens for all these semi-sentences (or even word by word) becomes quickly tedious and probably going to reach some limit.

Thank you in advance!

I think more detail into your problem would definitely make it easier to help you.
However, I believe the answer is no, Menhir and Yacc both take as input a sequence of tokens with semantic values attached. I am not even sure what kind of representation you would like.

Oh and, yes Menhir is a lot better than OCamlyacc, look up the “library” section of its doc. Its also a lot faster, and uses less memory.

If you want more flexible parsers, you might want to look into combinator parsers. You will lose some performance though.

1 Like

To be more specific on my “more detail” request :

How do you now you need to extract Alice from there ? Is this because it follows a “My name is” keyword ? If so, you would probably want the lexer to detect that “My name is” is a keyword, and that would produce its own token, and then “Alice” would just be a general “Text” token.

1 Like

Hello Emile! Thanks a lot. So, I’m building a tool to parse contracts that are written in a certain format.

Contract Example

The parties:

Will Smith, undermentioned as WS; and
Try Bank, undermentioned as TB.

Hereby enter in a Bond Purchase Agreement defined as follows:

The TB agrees on issuing and selling a bond of EUR 10,000.00 to WS for EUR 9,800.00. The aforementioned bond reaches maturity on the 1st of October 2025.

Signed by WS and TB on the 24th of September 2021.

What I need

Basically I need to capture the bold parts by defining a grammar. And there’s many other types of contracts and thus, grammar rules. I found a parser in Go called participle where this would be quite easy to do. However, one of the requirements is that I use OCaml for this project.

So yes, there is a fixed structure and I could use keywords. However, I’m afraid that there’s way too many tokens and it won’t escalate. In participle, I can match literals with any token. So I can have a token called WORD that matches words and if I write 'My' 'name' 'is' @WORD on the parser, it would use the value from @WORD for the right field.

Is there anything similar or would I need to define keywords for all the structures? I feel they would become… too may rather quickly.

Thanks a lot :slight_smile:

I do not think you have any kind of nesting in your grammar, so maybe you should just use the lexer.
The lexer can have rules of the type "after having matched that, go into this mode (or state) and … "
It may be sufficient for your needs.

ocamllex has the -ml options where it does not use a table, but compiles rules to OCaml functions.

This should allow to add as many keywords as you like.

As a side node the information about this option is contradictory. man ocamllex says that it should only be used for debugging, and the OCaml manual says:

-ml
Output code that does not use OCaml’s built-in automata interpreter. Instead, the automaton is encoded by OCaml functions. This option improves performance when using the native compiler, but decreases it when using the bytecode compiler.

What you seem to be looking for is called scannerless parsing and is often combined with GLR grammars and Earley parsing. In OCaml, GitHub - rlepigre/ocaml-earley: Parsing library based on Earley Algorithm looks close to what you’re after. There’s also Dypgen.

What you are really looking for is natural language processing (NLP) resources in OCaml. I don’t have pointers to offer.

4 Likes

Thank you for your answers. I will investigate if only the use of the lexer itself is enough (even though I’d prefer to have a way to ignore the whitespaces all the time), or if the I manage to make it work with ocaml-earley. :slight_smile: