Ocamlopt vs Duplo/LLVM code generation

We have recently presented the Duplo Post-Link optimiser, which among other things manages to
compile OCaml to machine code through the code generator of LLVM on amd64.

I built a histogram comparing code generated by ocamlopt and the LLVM backend, without enabling any optimisations at the LLIR level, on the intermediate representation used by the Duplo framework.

The histogram counts the occurrence of each instruction in ocamlopt. I have also included some
comments to highlight which instruction choices are likely to be better/more compact - while one of the issue has been since addressed, there data does highlight a few other opportunities for improvement.

6 Likes

Could you explain the LLVM path in a little more detail? How are you compiling via LLVM?

There is a nice ICFP paper on this: https://dl.acm.org/doi/10.1145/3408980

2 Likes

We do not compile to LLVM IR, but to a representation we call LLIR that keeps track of the types of virtual registers in OCaml - int, float, value or address. The IR is then lowered to LLVM’s MachineIR through the SelectionDAG instruction selector. A custom instruction was added to handle GC metadata and some passes were modified to preserve the semantics of GC roots.

3 Likes

I missed the NY QA time for this paper.
Was there an answer to whether this will work with MultiCore OCaml work?
And how can someone try this out, is there an opam switch for it?

1 Like

They mention it in the paper. There’s nothing preventing multicore from working, but the code hasn’t been adapted to it.

1 Like

Thanks, I’ve not fully read the paper.

There are instructions on setting up a pin here.
Unfortunately, some packages which rely on amd64 inline assembly or do not query the ocaml environment for the right C compiler to use (CC=llir-gcc) will not install at the moment.
We have the required diffs/versions at in our version of the sandmark benchmark suite and we will create a repository out of them.

1 Like

Thanks for all the data and the suggestions. As you mention, a couple of them (32-bit immediate load into 64-bit register; no “call” instruction to set up trap handlers) are already implemented in OCaml 4.11.

The ocamlopt back-end was initially designed for RISC processors, with load instructions separate from computational instructions, so things like “callq (reg64)” just don’t fit.

1 Like

@nandor, I have a package which uses AVX intrinsics. While compiling, the error I get is -

Do not know how to split the result of this operator

With the invocation being -

clang-12 -cc1 -triple x86_64-unknown-linux-gnu -S -disable-free -disable-llvm-verifier -discard-value-names -main-file-name bigstringaf_simd_avx2.c -mrelocation-model pic -pic-level 2 -mframe-pointer=none -relaxed-aliasing -fmath-errno -fno-rounding-math -no-integrated-as -mconstructor-aliases -target-cpu x86-64 -target-feature +avx2 -tune-cpu generic -fno-split-dwarf-inlining -debug-info-kind=limited -dwarf-version=4 -debugger-tuning=gdb -v -resource-dir /home/anmol/.opam/llir/llvm/lib/clang/12.0.0 -D _FILE_OFFSET_BITS=64 -D _REENTRANT -I /home/anmol/.opam/llir/lib/ocaml -mllvm -llir -llir -isysroot /home/anmol/.opam/llir -internal-externc-isystem /home/anmol/.opam/llir/include -internal-isystem /home/anmol/.opam/llir/llvm/lib/clang/12.0.0/include -O2 -Wall -Wextra -Wpedantic -fno-dwarf-directory-asm -fdebug-compilation-dir /home/anmol/bigstringaf/_build/default/lib -ferror-limit 19 -fwrapv -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -o /tmp/bigstringaf_simd_avx2-ff6282.s -x c bigstringaf_simd_avx2.c

I am assuming this is because lowering for these intrinsics has not been implemented maybe? Any thoughts on this?

Unfortunately the C-to-LLIR lowering in Clang does not yet support SSE intrinsics.

Oh okay. Makes sense. Thank you! Really enjoyed reading the paper and its definitely pretty cool work.