ocamlopt emits code like
leaq 1(%rax,%rax),%rax for tagging integers.
However, for all Intel Core processors since 2011,
leaq 1(,%rax,2),%rax would be better.
- PRO: shorter latency (1 cycle vs. 3 cycles)
- PRO: higher throughput (2 per cycle vs. 1 per cycle)
- CON: larger encoding (8 bytes vs. 5 bytes)
For AMD “Ryzen” processors, the above is not an optimization (same latency and throughput).
Please consider adapting the code generator for Intel64 in
ocamlopt. Thank you!
PS. For details on the timing, see https://www.agner.org/optimize/