Generating a map at compile time?

Oh dear. It’s not a secret, it’s a table of the standard library functions in a DSL compiler that I maintain. I didn’t mention specifics in the OP because 1) I think this is a problem of general interest beyond my specific case and 2) I am weary of hearing suggestions for alternative implementations that could derail the actual curiosity at hand. Suffice it to say that we already know that this design (essentially generating combinatorially many options for different sequences of input type to overloaded functions) is a local maximum of the design space, for reasons that have nothing to do with the speed concern that prompted me to start this thread.

More salient to the rest of your comment, the only strings are the keys (names of functions), which will be unique, so I don’t know if interning would end up mattering much. The values store custom types. The immutable array type is interesting (for this and for other things I’m sure).

My current status with this thread is that I actually think the performance I can achieve with just the ideas already in this thread is very solidly in the “good enough” camp – further developments would be mostly interesting to me if they make the developer story nicer in terms of things like type safety, easier to build, etc, etc.

Funny you mention, one of the reasons that this same compiler doesn’t support unicode identifiers is because I failed a couple attempts to get the uu-libraries to install in the docker emulation we use to build non-x86 binaries… This is probably another reason to not try the “torture ocamlopt” path over Marshal

Also I found this:

and that

a bit difficult to engage with. Don’t believe, measure. I think this thread gave many unreasonable proposals but without having the root cause precisely pinned it’s a bit difficult to give good advice.

No issue. No problem. Please do report these problems.

Certainly agree. I have been profiling throughout these experiments to have evidence to justify the change to the rest of the team if I propose it. There is definitely some question over what exactly to use as profiling input. If I chose small programs (or dump things like just the help output), it would easily reach 60% of runtime, but on large examples (at the upper limits of what I would see in the wild), it was closer to 8-10%. That’s still quite a bit for something that is statically knowable at compile time, I’d say.

The wishy-washy nature of the second post you quoted was just because the structure of the code currently calculates and inserts at the same time, which made the profiler output not clear to the level I would have hoped. The different code changes I’ve tried in response to this thread have actually made it much clearer, which is a benefit in of itself, and the cost of allocating the map is indeed tiny compared to the computing of the entries

As far as I could tell, my issue was a duplicate of the several closed Uunf issues related to stack size on these platforms. But the suggestion of using ulimit wasn’t working for me (due to some specifics of our docker set up at the time, IIRC). There were other reasons the feature was questionable at the time, so I didn’t pursue it further. The CI environment we use has evolved a bit since I’ve last tried, so it may just work next time

1 Like

These should likely be solved by using OCaml 5 now that stack frames are heap allocated (but it could also simply convert to an OOM in the container).

1 Like

I’ve been further exploring the solution using Marshal. Interestingly, there seems to be a decent performance regression with that solution on OCaml 5.3 versus 4.14. landmarks reports that un-Marshaling the alist and calling from_alist_exn takes 22.30M cycles in 5.3, compared to only 7.55M cycles in 4.14 (this is still faster than not doing this and just building the list at runtime, which clocks in at 63.78M cycles). This seems to be the same regardless of any tricks suggested here in terms of pre-let-defining the strings or similar.

Is this something that is previously known that I can read more about?

Possibly OCaml 5 performance regression for unmarshal-heavy workloads · Issue #13300 · ocaml/ocaml · GitHub

1 Like

Thanks, that seems to almost certainly be it. I was about to post a synthetic example where this could be observed and it was very nearly an exact re-creation of this comment.

This might be unhelpful, but my initial thought was to use a high-speed key-value store like LMDB and marshal the stored values, and putting that behind an in-memory cache (i.e. a hash table), so you only pay to build the hash table for values you actually use at runtime.

Maybe you’ve already considered this approach and it’s not a fit for your problem. It’s just what I thought of.

1 Like

If you’re looking for a solution that would remain compatible with every ocaml version, there would be the solution to print code. Just printing the data as ocaml code printf "[| ... etc", to make it possible you can just copy the hashtbl or the map module locally in your project, and un-hide the type t in the interface.
And if you want to use a Map instead of an Hashtbl, and if marshalling the map is not possible because of the fact that it contains a function pointer, you can also copy this module locally in your local project, and modify it. It’s quite easy to change it without a functor, the compare function can be passed as a parameter instead, and you get an interface similar than Hashtbl, and you can also un-hide the type t, so that you can print the code of your data. And if the data is big, you could print it in its own module, and compile it only once, as a .cmx object.

That’s, cool, I didn’t know that Stan uses OCaml to compile! Do you know of plans for Stan bindings to make it directly usable in OCaml as well?

[EDIT] it looks like the newer bindings in other languages also go through the cmdline executable, but expose the results as values. So it’s ‘only’ the wrapping of the output that’s missing. relevant thread.

1 Like

Yes, if you wanted to build a CmdStanOCaml the main difficulties are cmdstan’s baroque command line argument structure (which wouldn’t be too hard to represent with some sort of builder pattern I think) and then processing the output. The latter might suffer from the general lack of n-d array libraries.

We’re always happy to pick up more people interested in both Stan and OCaml, though! The compiler is a fun project that could definitely use more eyes than just mine and a couple of the ocaml-familiar-enough developers who review


If people are curious, here’s the approach I ended up opening a PR for and that we will likely be using. It’s essentially the dune + marshal combination discussed above. I also included some of the benchmarking data in the PR comments – the tl;dr is that it ends up being 20-40% faster in OCaml 4.14, and ~5-30% faster in OCaml 5.3 (due the the un-Marshaling speed regression linked above)

2 Likes