We have recently presented the Duplo Post-Link optimiser, which among other things manages to
compile OCaml to machine code through the code generator of LLVM on amd64.
I built a histogram comparing code generated by ocamlopt and the LLVM backend, without enabling any optimisations at the LLIR level, on the intermediate representation used by the Duplo framework.
The histogram counts the occurrence of each instruction in ocamlopt. I have also included some
comments to highlight which instruction choices are likely to be better/more compact - while one of the issue has been since addressed, there data does highlight a few other opportunities for improvement.
We do not compile to LLVM IR, but to a representation we call LLIR that keeps track of the types of virtual registers in OCaml - int, float, value or address. The IR is then lowered to LLVM’s MachineIR through the SelectionDAG instruction selector. A custom instruction was added to handle GC metadata and some passes were modified to preserve the semantics of GC roots.
I missed the NY QA time for this paper.
Was there an answer to whether this will work with MultiCore OCaml work?
And how can someone try this out, is there an opam switch for it?
There are instructions on setting up a pin here.
Unfortunately, some packages which rely on amd64 inline assembly or do not query the ocaml environment for the right C compiler to use (CC=llir-gcc) will not install at the moment.
We have the required diffs/versions at in our version of the sandmark benchmark suite and we will create a repository out of them.
Thanks for all the data and the suggestions. As you mention, a couple of them (32-bit immediate load into 64-bit register; no “call” instruction to set up trap handlers) are already implemented in OCaml 4.11.
The ocamlopt back-end was initially designed for RISC processors, with load instructions separate from computational instructions, so things like “callq (reg64)” just don’t fit.