Is ppx @@deriving the right solution for "data.ml" -> "data.rs"

File data.ml defines a number of types:

  • tuples
  • records
  • variants

For each of these, I want to automatically define a corresponding type in data.rs (notice extension: Rust file).

My current plan is to use @@deriving .

In particular, to hijack ppxlib/ppx_deriving_accessors.ml at main · ocaml-ppx/ppxlib · GitHub to insert some file IO.

However, the issue that I immediately run into is that this function will be called once per @@deriving , and I can’t really control when / how / how frequently it is called; in particular in relation with incremental compilation.

What I really want to say instead is:

  1. somehow ‘tag’ certain types in the ocaml module
    2 at the end of a successful build, take all the ‘tagged’ types, and process them all at once

Advice ?

====

We have a bunch of *.ml files. I want to tag some ‘types’ (tuple, record, variant). At the end of a successful compile, I want to run another ocaml program which takes as input all the ‘tagged’ types, then generates corresponding Rust tuples/structs/enums for each of the types.

1 Like

Not sure if ppx or code generator approach is better for your specific use case, but if you want to try a codegen approach, you can use dune rules to automatically run the codegen program to generate/update the needed files.

I’m sure there are many interesting examples of this in the wild, but if you want some specific examples of using dune rules for code generation, the pyml_bindgen github repository has a few examples that use rules to generate pyml bindings.

And by the way, you can still use the attributes in your code even if they don’t hook into ppx, eg, for use in your codegen programs.

1 Like

Thanks for the rune / rules suggestion.

I don’t know if this is the right way to think about this, but I’m converging towards:

  1. are you generating OCaml code that needs to be inlined? => @@deriving

  2. are you generating non-OCaml code to be used later? => codegen, add a rule to dune

===

Assuming the above classification is right, the problem reduces to (may or may not be ppx): we need something that can parse the *.ml files, find the tagged types, then process them.

2 Likes

Did not realize (1) you are the author of ocaml python bindgen, and (2) bin + lib is only 1.6k LOC; I think I can make this work.

1 Like

(Yes, I’m the pyml_bindgen author…Maybe i should have put the disclaimer, haha.) You’re right though, the code is not that long, and not really tricky or clever. And in fact, it could probably be better/simpler/shorter even than it is now, as it started as something much smaller and then accumulated more and more little things. But that is the hindsight, of course.

Again, there are probably more robust/interesting/cool ways to do it, but if you just need to parse your ml files for attributes, then back up and get those decorated types and parse them, and (if i understand you correctly) given that you are in full control of the input to your codegen, it would probably be pretty straightforward to roll your own tiny parser with angstrom (or menhir or whatever you like) tuned exactly to the stuff you will be parsing. Then from there, use the type info & your generated AST, then use it generate your rust files.

Just one thing to mention, with the risk of going too much off topic, I think it would could be fun/useful to use ppx for something like pyml_bindgen, just I haven’t gotten around to thinking on it much. (But the difference there is, it is generating ocaml code, but in your case, you’re generating rust code.) So, to your rules, sometimes ocaml codegen can “get the job done” as well.

This is great!

I’ve copied over bin/ and lib/ from your project. I’m now looking at examples/ and trying to find the one closest to my needs, i.e. input = *.ml, output = *.rs

However, if I’m reading the comments correctly, it seems many of these are focused on input=*.py, output = *.ml ? I.e. it’s solving the reverse problem of “python = source of truth; generate bindings for ml”; whereas I’m trying to find code of the form

“ocaml = source of truth; generate *.py files output”

Am I misreading the examples in examples/ ?

EDIT: I’m looking at ocaml_python_bindgen/examples at main · mooreryan/ocaml_python_bindgen · GitHub and somehow not finding an example where

input = *.ml files
output = *.py

I feel like I’m missing something obvious.

So the main input is a subset of things allowed in OCaml value specifications, ie, these. You can see where these are passed in to the codegen program in the dune rule right here. Here is the file generated.

So it goes OCaml value specs => OCaml code with pyml bindings.

So in your case (again if I understand your need correctly), it would be ml files => rust files. And you can use dune rule to do that as well. (As an example, I have used dune rules to generate html and js files from jsoo into a public directory for serving.)

By the way, feel free to open any issues if the docs aren’t clear (as it sounds as if I could have done a better job explaining things).

Edit: this rule has lots of comments, if it may help you.

Edit2: you could imagine a rule going something like this for your use case:

(rule
 (target snazzy_rust.rs)
 (mode ...)
 (deps ...)
 (action 
  (with-stdout-to
   snazzy_rust.rs
   (run snazzy_gen ./*.ml)))
... more stuff ...
)

Let me just drop a couple of links with some interesting perspectives that may be of interest:

Thank you for clarifying this. I completely misunderstood this point.

I’m looking at ocaml_python_bindgen/examples/attributes/lib/examples_attributes_lib.ml at e17762a03d8726ecdb92f3e31ae9f6ab3687e8c5 · mooreryan/ocaml_python_bindgen · GitHub and the type t is a PythonObject.

===

The focus here seems to be: “generate OCaml bindings to an existing Python library”. This might end up being not as close to want.

I’m after things like this:

module MyRecord = struct
  type t = {
    a: int,
    b: string,
    c: int
  }
end

module MyVariant = struct
  type t =
    | Ok of string * int
    | Err of int * int
end

and then, on a successful build, having it generate blah.rs of

struct MyRecord {
  a: int32, b: String, c: i32 }

enum MyVariant {
  Ok(String, i32),
  Err(i32, i32)
}

===

The two gaps I see from what exists and what I need are:

  1. the input file is a txt, not *.ml (this is not a big deal)

  2. it appears the focus in your case is binding existing python functions; whereas my case is focused on reading OCaml structs

I’m not sure if point (2) is correct, though looking through find . | grep txt, all the files ,with the exception of lib_ml.txt, seem to be focused on binidng functions, not structs.

Please let me know if I have misunderstood anything here.

EDIT: uppercase → bold, to avoid it looking like shouting

So the file you linked (this one) is the code generated by the pyml_bindgen program.

^ That’s correct. It generates the bindings by reading ocaml value specifications. Note that I just put them as .txt files, but really it is syntax that would go in an mli file or in the module signature.

You could take your example (by the way, im assuming this file would be part of your normal ml code files right) and add attributes to the code. Like this :

module MyRecord = struct
  type t = {
    a: int,
    b: string,
    c: int
  }
  [@@snazzy_rust magic]
end

module MyVariant = struct
  type t =
    | Ok of string * int
    | Err of int * int
  [@@snazzy_rust sparkles]
end

The attributes will be ignored when compiling your ml files, but your codegen can use them to mark things you want. Then it’s a matter of parsing the types into AST/internal format using your favorite parser, or even compiler libs if you don’t mind some instability. From there you can translate your ocaml ast to rust ast or code and print it out.

^ It is true that I’m generating bindings to existing python packages/functions/things, but I’m doing it by parsing ocaml code (value specifications), then generating ocaml code (modules that use pyml) based on the value specifications. You’re right, it’s generating bindings to python functions/methods and then using abstract types or pyobjects, but it still parses ml(i) “text” stuff to do it, and not even in a fancy way, just using a custom stripped down angstrom parser rather than hooking in to ppx or compiler libs (whether that was the right choice or not is a different matter).

Edit: tl;dr is that while the specifics of the pyml_bindgen code itself may not be all that useful to you, the general approach may be: read & parse ml => generate rust and use dune rules to automate it all.

Edit2: in case it’s not clear, pyml_bindgen also uses custom attributes to guide/customize the codegen.

Edit3: value specifications in the ocaml manual

  1. Thank you for your patience in answering all these questions.

  2. I think this is the core of my misunderstanding.

I incorrectly thought you had this framework already in place for OCaml tuple/variant/record → Python tuple/variant/record, and the only thing I had to do was to find the code that generate *.py and change it to generate *.rs .

  1. However, in actuality, this “OCaml tuple/variant/record → Python tuple/variant/record” code does not exist in the repo because the repo is focused on solving a different problem – generating OCaml bindings to existing Python functions.

^-- Is the above correct? If so, this would clarify up all the misunderstandings.

No worries, hopefully it’s helpful or can save you unnecessary time/work.

That is correct. What does exist is code to go from those OCaml value specifications that describe how you want to use the function in OCaml world and how to generate OCaml code that uses pyml to give you a correct binding that reflects that. So probably not code you could lift directly.

It could be the case that my original post mentioning pyml_bindgen gave you the wrong idea. The reason I brought up pyml_bindgen was that the “big idea” is sort of similar to what you want to do, read ml and generate something. And it has (imo) decently commented dune files with examples of rules showing how to automate the process with dune that may prove helpful if you decide to go with a codegen approach.

Thanks! I think I figure out where I went wrong. Is the following correct:

  1. We have foo.py. We want foo.ml (for calling foo.py).

  2. We write a foo.txt (OCaml val sigs).

  3. We run your generator.

  4. It generates foo.ml

===

  1. In short, your lib is.

Input = foo.txt (OCaml val sig)
Output = foo.ml (OCaml code)

  1. What I want is: input = foo.ml, output = foo.rs

  2. However, if I wanted to use your framework, I could do:

input = foo.txt (OCaml type sig)
output = foo.ml (just cp file) and foo.rs (do some generating)

====

  1. What parts of your code do I need to modify?

8a. The *.txt file, instead of being parsed for value-sig, is now parsed for type-sig.

8b. The *.txt file, instead of generating foo.ml, now generates foo.rs.

====

That’s about it, if I want to build on the existing structure right ?

What you describe sounds like what I would try (at least as a first cut).

By the way are you wanting generation from type signatures that exist in your other ocaml source files or will you have a specific little ml-like file that doesn’t necessarily even have to be valid ocaml? (Just curious, I think the approach is similar either way.)

I would probably approach it something like this, yeah. Details may depend on if you need to generate rust from existing ml source files, or if you are writing specific files containing types in ml syntax, and generating rust from those. But yeah, to me that sort of codegen approach sounds reasonable.

You would have to modify a lot to use code from pyml_bindgen i would think. It is using a bespoke parser for the subset of value specification that it can generate code for, so there isn’t any interesting ppxlib or compiler lib stuff to parse ocaml code or to deal with ocaml ast that you code reuse for your operation. The task was simple enough that the custom parser was enough for my needs, and that parser is only for the val specs (which isn’t what you need anyway).

Now, if the types you expect to handle are simple enough, you could probably do the same and whip up a tiny parser & ast for the subset you care about, but if it gets more complicated you may want to consider the tools in compiler libs (which aren’t exactly stable as far as I know) to deal with ocaml parsing and ast manipulation if you need it.

So yeah simplest option is you only need to parse a small subset of all possible ocaml types and you don’t even need those to be part of an ml file youre actually compiling…then just write a small parser of ml to ast, then to rust in a similar way to what pyml_bindgen does (again, not necessarily using the code in that project directly, but taking inspiration from the workflow, if that makes sense). More complicated is if you want to take the ml files in your real project, annotate them with attributes, use a fancier parser/ast lib to deal with it then generate rust code from that. Still a generator approach would work if you wanted to tackle it. If you want something more complicated than one of those two options, then it’s possible you may need a completely different approach.

I’m going to sleep on this and hopefully dream of a solution.

Thanks again for all your help / patience through this – absolutely helped me work from a vague cloud of ideas to concrete things to try. Cheers!

1 Like

Instead of defining a new grammar, I’m going to go with plain OCaml. This way, I can avoid defining my own treesitter / LSP.

Here’s what I have so far:

module Type_Id = struct
  type t = string

  let compare (x: t) (y: t) = String.compare x y;;

  let sexp_of_t (_: t) : Base.Sexp.t = Base.Sexp.List [];;
end


module Field_Name = struct
  type t = string
end

module OCaml_Type = struct
  type t =
    | Prim of Type_Id.t
    | Record of (Field_Name.t * Type_Id.t) array
    | Variant of (Field_Name.t * t) array
    | Tuple of Type_Id.t array
end


module OCaml_Module = struct
  type t = (Type_Id.t, OCaml_Type.t, Type_Id.t) Base.Map.t;;
end


let write_rust (_m: OCaml_Module.t) : string = "";;

let global_register : (Type_Id.t * OCaml_Type.t) list ref = ref [] ;;

Here is the general idea. I can use a @@deriving my_rust for each type I care about, which generates a

let () = ... adds self to global_register list ...

This way, on startup, everything should be added to the global global_register. Then we all the rust generator on it.

My main obstacle right now is, in:

module Foo = struct
  type t = { ... } [@@deriving my_rust]

can the deriving get the name “Foo” (of the Module, defined outside the t node).

1 Like

Update: Solved in GitHub - zeroexcuses/learn_camlp5 (actual work by @Chet_Murthy ; typos my fault ).