Ocamlopt vs Duplo/LLVM code generation

nand · August 26, 2020, 9:29pm

We have recently presented the Duplo Post-Link optimiser, which among other things manages to
compile OCaml to machine code through the code generator of LLVM on amd64.

I built a histogram comparing code generated by ocamlopt and the LLVM backend, without enabling any optimisations at the LLIR level, on the intermediate representation used by the Duplo framework.

github.com

nandor/ocaml-llir-comparison/blob/master/results.txt

INSTRUCTION                                OCAML    LLIR

cmp IMM,REG_8                                  0      28
cmp IMM,REG_32                                 0       1
cmp REG_64,IMM(REG_64)                         0     267
cmp REG_64,IMM(REG_64,REG_64,IMM)              0       3
cmp REG_64,(REG_64)                            0     307
cmp (REG_64),REG_64                            0      90
cmp IMM(REG_64,REG_64,IMM),REG_64              0       1
cmp IMM(REG_64),REG_64                         0     357
cmpq IMM,IMM(REG_64,REG_64,IMM)                0      11
cmpq IMM,(REG_64)                              0     780
cmpq IMM,IMM(REG_64)                           1    1681
cmpb IMM,IMM(REG_64)                           0    1467
cmp REG_64,REG_64                           3391    2219
cmp IMM,REG_64                             12812    8944

  LLVM can better fold constants and addresses into the operands of cmp

test REG_8,REG_8                               0       7

This file has been truncated. show original

The histogram counts the occurrence of each instruction in ocamlopt. I have also included some
comments to highlight which instruction choices are likely to be better/more compact - while one of the issue has been since addressed, there data does highlight a few other opportunities for improvement.

bluddy · August 26, 2020, 9:59pm

Could you explain the LLVM path in a little more detail? How are you compiling via LLVM?

mseri · August 26, 2020, 10:00pm

There is a nice ICFP paper on this: https://dl.acm.org/doi/10.1145/3408980

nand · August 26, 2020, 10:04pm

We do not compile to LLVM IR, but to a representation we call LLIR that keeps track of the types of virtual registers in OCaml - int, float, value or address. The IR is then lowered to LLVM’s MachineIR through the SelectionDAG instruction selector. A custom instruction was added to handle GC metadata and some passes were modified to preserve the semantics of GC roots.

lambda_foo · August 27, 2020, 2:56am

I missed the NY QA time for this paper.
Was there an answer to whether this will work with MultiCore OCaml work?
And how can someone try this out, is there an opam switch for it?

bluddy · August 27, 2020, 3:03am

They mention it in the paper. There’s nothing preventing multicore from working, but the code hasn’t been adapted to it.

lambda_foo · August 27, 2020, 3:04am

Thanks, I’ve not fully read the paper.

nand · August 27, 2020, 6:45am

There are instructions on setting up a pin here.
Unfortunately, some packages which rely on amd64 inline assembly or do not query the ocaml environment for the right C compiler to use (CC=llir-gcc) will not install at the moment.
We have the required diffs/versions at in our version of the sandmark benchmark suite and we will create a repository out of them.

xavierleroy · August 27, 2020, 9:51am

Thanks for all the data and the suggestions. As you mention, a couple of them (32-bit immediate load into 64-bit register; no “call” instruction to set up trap handlers) are already implemented in OCaml 4.11.

The ocamlopt back-end was initially designed for RISC processors, with load instructions separate from computational instructions, so things like “callq (reg64)” just don’t fit.

anmolsahoo25 · September 15, 2020, 12:27pm

@nandor, I have a package which uses AVX intrinsics. While compiling, the error I get is -

Do not know how to split the result of this operator

With the invocation being -

clang-12 -cc1 -triple x86_64-unknown-linux-gnu -S -disable-free -disable-llvm-verifier -discard-value-names -main-file-name bigstringaf_simd_avx2.c -mrelocation-model pic -pic-level 2 -mframe-pointer=none -relaxed-aliasing -fmath-errno -fno-rounding-math -no-integrated-as -mconstructor-aliases -target-cpu x86-64 -target-feature +avx2 -tune-cpu generic -fno-split-dwarf-inlining -debug-info-kind=limited -dwarf-version=4 -debugger-tuning=gdb -v -resource-dir /home/anmol/.opam/llir/llvm/lib/clang/12.0.0 -D _FILE_OFFSET_BITS=64 -D _REENTRANT -I /home/anmol/.opam/llir/lib/ocaml -mllvm -llir -llir -isysroot /home/anmol/.opam/llir -internal-externc-isystem /home/anmol/.opam/llir/include -internal-isystem /home/anmol/.opam/llir/llvm/lib/clang/12.0.0/include -O2 -Wall -Wextra -Wpedantic -fno-dwarf-directory-asm -fdebug-compilation-dir /home/anmol/bigstringaf/_build/default/lib -ferror-limit 19 -fwrapv -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -o /tmp/bigstringaf_simd_avx2-ff6282.s -x c bigstringaf_simd_avx2.c

I am assuming this is because lowering for these intrinsics has not been implemented maybe? Any thoughts on this?

nand · September 15, 2020, 12:54pm

Unfortunately the C-to-LLIR lowering in Clang does not yet support SSE intrinsics.

anmolsahoo25 · September 16, 2020, 6:21pm

Oh okay. Makes sense. Thank you! Really enjoyed reading the paper and its definitely pretty cool work.

Topic		Replies	Views
[Announce] llopt 1.0.0: Check you assumptions about LLVM optimizations Community release	2	1663	March 26, 2018
Multicore OCaml: November 2020 Community multicore , multicore-monthly	1	3242	February 6, 2021
OCamlPro talks at ICFP 2023 on Flambda2, Wasocaml, and a lookback on Opam Community opam , flambda , wasm	0	562	September 7, 2023
[ANN] LLVM 15 is out! Community	9	2535	September 17, 2023
Is this optimized by the OCaml compiler? Learning	9	789	January 12, 2023

Ocamlopt vs Duplo/LLVM code generation

Related topics