It was a good excuse to experiment with non-dune build systems (to scope things out). I went for a plain Makefile in the end which works well.
I also wanted to figure out a better way to embed data in an executable. I ended up wondering about moving as much of the processing as possible into the build phase. What I ended up with is a small program which prints a compilation unit (.ml) which has mostly array literals. Still have some open questions on that, any input welcome:
Should I have used meta-ocaml to print the code? The data/munch.ml would probably be more readable, but the build probably less.
How could I generate this kind of processed-data code for data-structures which don’t have a literal (maps, sets, hash tables, etc.)? How can I minimise the initialisation cost of the program for such situations?
At some point I investigated using malfunction as an alternative implementation to crunch (which I assume you are aware of hence your munch.ml) because I had the perception that parsing large OCaml source files with huge data literals was slow. I made a small proof-of-concept that maybe read data from a hard-coded file. I got stuck on figuring out how to make dune use this. For a makefile setup this wouldn’t be a problem I think. Unfortunately, I’m afraid the code only lives on the drive of my laptop that has since died. Anyway, I think an approach with Malfunction could be interesting
I have used crunch in my previous random generator but I was unsatisfied with having the fake I/O and the parsing (even though it’s just some line splitting).
I was more focused on the cost of initialisation during the execution of the binary (e.g., avoiding a Hashtbl.of_seq (Array.to_seq <big-literal-here>) which does a whole lot of hashing, the same hashing every time as well). I hadn’t considered compilation time. That’s an interesting consideration and maybe I’ll use that next?
I’ve considered using marshaled data in a file but it still requires readin a file. And also having an error-revovery mechanism for when the file is not present (paths having changed, XDG variables having been modified, etc.)
I think it’d be useful to have a small library that deals with the error-recovery: load the file if present and if it works, otherwise regenerate the file and rewrite it to disk. Do you know if anything like that exists?
You can also marshal into a string, and generate a .ml file which defines this string.
At runtime you unmarshal from the string, which will be part of (the static data of) your executable.
There’s a bit of work to make sure the string is escaped properly, but I think Printf.printf "%S" will work (you can also use the {foo|... |foo} syntax if you can check in advance for occurrences of |foo} in the string).
I don’t really like "%S" it uses decimal escapes and one long line.
If you can afford a bit more code, the stdlib based code here makes a let for you, with hex escapes and restricted to 80 columns (not sure why it tries to compute the length of the buffer so precisely perhaps this wanted to use a bytes directly).
The marshal suggestions are nice. I’m now thinking of a ppx which turns
let foo[@marshalled] = <some expression>
into
let foo = Marshal.from_string "<%S escaped expression>"
It wouldn’t do anything within the dev profile (of dune) (to get faster feedback loop), and only be active for the release profile. I think it can work with some dune ocaml top-module <file-name>. I’ll give it a go at some point.
Sorry, I think my personal anecdote was confusing. I didn’t mean you should necessarily re-implement crunch, too. Instead, you could use malfunction to basically build a highly specialized compiler
let csv = [%blob "../resource/something.csv"] in
frobnicate csv
and that is all it takes to embed the contents of a file. At compile time. So if you change the resource you better delete the *.cm* files as to force an update.
That looks nice! It doesn’t solve the issue of parsing/initialisation but it does offer a simple mechanism for including a marshalled blob. Thanks for the pointer.