Advice on how to componentize and later combine source code?

I’m writing this note to ask for suggestions from readers. First, some setup: we all know that LR grammars are … famously fickle when it comes to adding rules post-hoc: you really do need to understand the entire grammar in order to do it safely. It’s widely-accepted (and I certainly believe it) that LL(1) grammars are much more amenable to post-hoc extension – the acceptability of a new production can typically be decided and explained based only on the other productions of that nonterminal, and not the entire grammar (ok, it’s not completely true, but pretty much). And so, in Camlp4/5, there’s an LL(1) grammar-interpreter, that can be dynamically extended at runtime. The grammar-rules are compiled separately, where they are defined, and are combined by the interpreter when modules containing rules are dynamically (or statically) linked.

So … I’m writing an LL(k) parser-generator, based on reading the ANTLR papers, and it will be intrinsically compiled (no interpreter). Partially, that’s because ANTLR supports parsing rules that take arguments (so you can express “accumulating parameters” in the grammar directly) and partially that’s because to do otherwise would require a ton of Obj.magic, and I’d like to avoid that. This means that things won’t be dynamically extensible at runtime, but I would like to preserve the ability to write a grammar in parts, combining them only when all the modules with grammar-rules have been installed.

So here’s my question: does anybody have any suggestions for how one might put bits of grammar-rules in many different modules, and then combine them later to allow for grammar-compilation ? I’m literally asking for suggestions on how packaging might be effected.

Sorry if this is far too open-ended: I’ve been focusing on getting the thing working, and not so much on how I might preserve extensibility.

For ocamllex there is an easy answer: each rule returning a token of typ tok becomes an OCaml function Lexing.buffer -> tok, and you could perfectly combine lexers by calling a rule from another module (or, within a functor, from a module parameter, etc.) This works because the OCaml type gives all information to combine/compose those rules – conversely, because the API of lexer rules is simple enough that it can be captured by a simple OCaml type without loss of modularity.

Can one do this for LL(k) rule? Is there a “type of the rule” that is enough to correctly “run” the rule without knowing its definition?

Your observation for ocamllex corresponds to “don’t allow nonterminals to reference each other, except in very, very, very controlled settings”. But with LL(1), it is very straightforward (and camlp4/5 do it, as does Coq, unless they’ve changed things) to have a nonterminal (let us say, expr for expressions) that has a bunch of infix operators at some level, and to introduce new infix operators at that precedence level. Or new syntax for modules, etc, etc.

I fully recognize that this is part of what Camlp4 a bad name, when that was the primary way to extend OCaml syntax, and except for very controlled uses, I’m 100% with the idea that PPX extensions are a better way. But for instance, the “stream parsers” are pretty difficult to write without extending the syntax – finding a way to “reuse” (I’d say “pervert”) existing OCaml syntax and then reinterpret that as stream-parsing, is pretty hard.

Back to your last question: first, the unit of composability would not be “a collection of nonterminals and all their productions” , but rather, a single production added to an existing nonterminal. So you can see that it’s going to be necessary to analyze the production in the context of the other productions of that nonterminal (at a minimum). But that’s already done in Camlp4/5. The real problem is that in ANTLR, you can have productions that take arguments, viz.

e: [ [ l = elist[ [] ] -> l ] ] ;
elist[ acc ]:
  [ [  n = INT ; ","; rv = elist [ (n :: acc) ] -> rv
    |  -> [ List.rev acc ]
    ] ] ;

[I hope I got that right] Notice that elist is of type int list -> int list, and the natural recursive-descent parser we would want to generate does its work by accumulating parameters. ANTLR supports this, and I think it’s pretty nice and useful. I wouldn’t want to eschew this. So if we’re going to do this without compiling code, we’re going to have a ton of Obj.magic and explicit manipulation of stacks and such. I didn’t grow up a Typed Programming Languages guy to get into that. And I’ve seen that that sort of stuff makes Camlp5’s grammar machinery pretty opaque, too, even though it doesn’t have parameters and such.

What I want to find, is a way to put bits of grammar rules (productions) into source-code files in individual OCaml modules, and then combine them together, do the global analysis[1] and emit a compiled parser.

[1] BTW, that analysis really is global. It’s just that, unlike with LR grammars, typically when there’s an error in some production, it’s easy-to-explain that error without recourse to far-away bits of the grammar – typically you can point at that production and other productions of the same nonterminal, to explain what went wrong. Also, I should note that menhir has made massive strides in this area, so at some level, ANTLR-like technology is perhaps obviated by Menhir. But OTOH, hey, it’s lovely stuff, and we don’t always do stuff b/c it’s strictly better – sometimes we do it b/c it’s beautiful.

Hope this helps.

ETA: Oops, I made that grammar accept lists of integers with a trailing comma. I could fix it, but it’s irrelevant for the issue I’m presenting, so I’ll leave it as-is.

I am not sure to understand why arguments make a difference. Your example is strictly equivalent to the following one:

e: [ [ l = elist -> (let l = l [] in l) ] ] ;
elist:
  [ [  n = INT ; ","; rv = elist -> (fun acc -> let rv = rv (n :: acc) in rv)
    | -> (fun acc -> List.rev acc)
    ] ] ;

and it is trivial to get one from the other in a purely syntactic manner. So, whether the rules take arguments or not is irrelevant.

What matters is how you handle the return values of your rules. For instance, in your original example, you have INT which returns an integer and e which returns a list. If you already know how to handle those, then supporting rule arguments will be straightforward and will not require any extra use of Obj.magic.

Ah. I should have specified that I wanted to avoid closure-creation, since this needs to be efficient enough to compete with parsers that do that.

In any case, either way if you’re going to write an interpreter, you have to sprinkle liberally with Obj.magic (as does Camlp5). I’m choosing to write a compiler so that I can avoid that.

AFAIR in Parsley GitHub - j-mie6/ParsleyHaskell: Reimplementation of Parsley in Haskell, with improvements Template Haskell deals with parsers that are spread to many files. Maybe MetaOCaml could perform the same trick? (for free :slight_smile: ) ?

Perhaps you could simply have your rules in separate files, and have a dune rule that depends on all those files, and a single invocation of your compiler (%{dep:X} automatically adds X as a dependency so you don’t need to list it twice, you could also use globbing to define a way to depend on all *.grammar files in a directory probably):

(rule
(target final.ml)
(action (with-stdout-to %{target}
   (run %{libexec:yourcompilerpackage:yourcompiler} %{dep:file1.grammar} %{dep:file2.grammar} ... %{dep:fileN.grammar}))))
)

If there is some pre-processing step that could be done in parallel, then you’d have to define some rules to produce a .grammar.preprocessed and “link” those (doing global analysis) at the end.
I don’t think dune provides good support for this currently, would probably be best to teach dune about your tool directly if that is needed, however I’d suggest experimenting with the simple sequential rule first.

If you want to ship grammars as part of multiple different libraries and link all those together at the end that’d be more complicated, but I think you could use dune's %{lib:yourlib1:file1.gammar} and %{lib:yourlib2:file2.grammar} to refer to them, and install stanzas to install them.

OK, so let’s get concrete. How are we going to write these grammars ? I’m currently doing it as a custom DSL, in a PPX extension, viz.

[%llk
{foo|
GRAMMAR G:
GLOBAL: etop ;

  e[x]: [ [ f = FLAG "foo" -> (f,x) ] ] ;
  etop: [ [ x = e[1] -> x ] ] ;

END;

|foo}
] ;;

Suppose I have a number of such PPX extensions, each in different modules. There’s the problem of how to combine them, and also of how to express that they need to be combined. Of course, the code is all source, b/c it can’t be compiled until it’s combined with all the other parts.

it’s possible the answer is simply that I’ll use cppo to #include grammar-fragments. But I figured I’d ask, in case some obvious solution that fit better with OCaml, was waiting.

How about in a.ml you have:

let grammar1 = [%llk {foo| .... |foo}]

In b.ml you have:

let grammar2 = [%llk {foo| .... |foo}]

And in main.ml you have:

let grammar = [%llkgen A.grammar1 B.grammar2]

A.grammar1 and B.grammar2 could simply be stored as strings (or as a parsed AST), and %llkgen is the one which would actually do the work of combining these grammars.
And then the dune file will probably be simpler to write, but it will require users to use a ppx. As a tradeoff you could also provide your users with a CLI tool that compiles multiple grammar files.
(The downside of PPX is that they tend to be incompatible with new compiler versions and require upgrading lots of packages to move to new compiler versions, so having your tool both in a PPX and non-PPX form will likely be useful for some users.)

Aha! I thought about your comment, and then it hit me! Here’s what I should do (see what you think?):

  1. in each module, a top-level attribute will hold the grammar rules (in file f.ml):
module M = struct
[@@@llk {foo| GRAMMAR G:...... grammar rules END |foo}]
end
  1. later in a module that composes these rules:
[@@@llk {foo| GRAMMAR G2: include F.M.G; ..... END |foo}]

The grammar-compiler will get the attribute (via CMTI files, IIRC, like ppx_import does today, and merge the rules with other rules fetched the same way. And then for code-snippets in each bit of grammar, I can wrap them with open F; open M so that references to free names are properly scoped. It won’t be perfect – there’s no way to only reference names prior to where the attribute was placed in the source-file, but it should be better than nothing.

Re: PPX & non-PPX, of course I have both – heh, the non-PPX came first, until I started writing unit-tests. I maintain Camlp5 and also write a ton of PPX rewriters with it, so I have a lot of experience with both ways of thinking of things. And … I’m pretty much convinced that PPX syntax was a good thing. And that for these grammars, it’s the right thing. Why? Because if you put your grammar-rules in a separate file, you have to put your associated OCaml code in separate files, too, instead of putting them in the same file when desirable. So for instance, in ( llk-ocaml/simple_test.ml at master · chetmurthy/llk-ocaml · GitHub ) you can see that grammars are mixed-in with tests, and that’s just simpler.

1 Like

Good point about ppx_import, does something fairly similar to what you want to achieve here (except for types and not grammars).