Leibniz pi calculation test - OCaml ranking

n4323 · March 16, 2026, 8:46am

So it seems there is no loophole. Function boundaries can only be crossed in a box.

silene · March 16, 2026, 10:27am

Actually, there is a small trick you can use here. Do not use a generic reference, but use a custom type instead:

type t = {mutable f:float}

let calc2 rounds =
  let double = 2 * rounds in
  let upto = double + 1 in
  let sum = { f = 0. } in
  let rec loop curr =
    if curr >= upto
    then ()
    else
      let _ = sum.f <- sum.f +. 1. /. float_of_int curr in
      loop (curr + 4) in
  loop (1 - double); sum.f

That way, the compiler will directly inline the floating-point value inside the structure, so the code will not incur any allocation (except the very first one) nor any write barrier.

n4323 · March 16, 2026, 10:58am

Wow, this is quite non-obvious to me. I would have thought the compiler knows the same in either case, as also the previous sum was known to be a float ref. Apparently an 'a ref where 'a = float is not optimized in the same way as a monomorphic custom type.

Would a Floatarray.t also work? I’m guessing a float array wouldn’t? Or does that depend on no-flat-float-array?

I’m wondering if this trick can be a way to allow breaking up numerical code into small functions, circumventing the float unboxing limitations. One would convert float arguments and return values into mutations, basically writing fortran routines.

silene · March 16, 2026, 11:58am

Yes, floatarray (and float array in flat mode) would also work just fine. But note that they would imply some bound checking. So, even if these checks are perfectly predicted by the processor, they would still incur some undue pressure on the decoding unit. So, you would have to use unsafe_get and unsafe_set to achieve the same performance as with a custom record.

craff · March 17, 2026, 8:31am

Hello,

I can not reproduce what the benchmark claim. I just check gcc versus ocaml 4.14 and 5.5

I use the c program from the github (pi2.c) (which output a different result), the ml program and another c program more similar to the c code (pi.c) and I get:

With gcc and ocaml 4.14.0

raffalli@oulala2:~/Programming/tmp$ gcc -O3 pi.c -o pi_c
raffalli@oulala2:~/Programming/tmp$ time ./pi_c
3.1415926525897171

real	0m0,861s
user	0m0,861s
sys	0m0,000s
raffalli@oulala2:~/Programming/tmp$ ocamlopt -O3 pi.ml -o pi_ml
raffalli@oulala2:~/Programming/tmp$ time ./pi_ml
3.1415926525897171

real	0m0,903s
user	0m0,902s
sys	0m0,000s
raffalli@oulala2:~/Programming/tmp$ gcc -O3 pi2.c -o pi2_c
raffalli@oulala2:~/Programming/tmp$ time ./pi2_c
3.1415926545880506

real	0m0,864s
user	0m0,864s
sys	0m0,000s

ocaml 5.4.0 is a bit faster.

raffalli@oulala2:~/Programming/tmp$ time ./pi_ml5
3.1415926525897171

real	0m0,863s
user	0m0,863s
sys	0m0,000s

Remark: I think reference in OCaml are optimized to register in that case (not specialized structure as proposed above).

Remark bis: the benchmark is using -02 for ocaml, but it hardly makes a difference.

craff · March 17, 2026, 8:35am

And if I try the same gcc option as the benchmark I get the same results

raffalli@oulala2:~/Programming/tmp$ gcc pi.c -o pi_opt -O3 -s -static -march=native -mtune=native -fomit-frame-pointer -fno-signed-zeros -fno-trapping-math -fassociative-math 
raffalli@oulala2:~/Programming/tmp$ time ./pi_opt
3.1415926525897171

real	0m0,862s
user	0m0,862s
sys	0m0,000s

craff · March 17, 2026, 8:37am

I am wondering if it is not a problem with ocaml running inside docker ?

silene · March 17, 2026, 9:28am

You need to at least pass a -O option to gcc, otherwise the performance are expected to be poor.

craff · March 17, 2026, 4:35pm

Argh, a space was missing. I fixed my post thanks.

craff · March 17, 2026, 5:52pm

If you want a correct rounding for pi (program that to no output this should be rejected), you must sum first the small values with code like this (even the parentheses are important)

let calc n =
  let sum = ref 0. in
  for i = n downto 0 do
    sum := !sum +. (1. /. float_of_int (4*i + 1)
                    -. 1. /. float_of_int (4*i + 3));
  done;
  !sum

let pi = 4 * calc (rounds/2)

Gives 3.14159265258979347735

and not 3.141592652589717094 as the original does

(correct: 3.1415926535897932384626433…)

We are much nearer to the correct IEEE 64 bits rounding.

Frederic_Loyer · March 21, 2026, 7:16pm

I have looked at the famous benchmark game, and find that OCaml is 2,74 times slower than C if we reject SSE/MMX code and multithread code. But there are 2 pragma optimisations. (I have computed the exponential of the mean of logarithm of the time ratio).

The worse code seems to be the k-nucleotide where indexing an array can be very fast in C, and less in OCaml (bound checking…).

This comparison is limited… it would be best to convert C code into OCaml and be sure to compare the same logic. (C codes have multiple propositions and usually OCaml only one… then less effort to optimise)

I note a ratio of 3,9 between the C code I have rejected and C code I have taken into a taken into a consideration with OCaml. In an other word, the difference between 2 different languages matters less than some very specific way (MMX/thread) of optimisation.

examachine · March 23, 2026, 12:26pm

God I’m tempted to Solve All The Problems after reading this post but I guess we should take it slowly

examachine · March 23, 2026, 12:27pm

Hmmm, will degrade performance I think, but not that much on linux?

examachine · March 23, 2026, 12:28pm

This is also curious you should probably tell the author.

UnixJunkie · March 23, 2026, 12:59pm

We are in front of Haskell and Python.

That’s very good news to me.

Frederic_Loyer · March 23, 2026, 6:59pm

Python is not a good reference… one of the slowest.

There is a interesting benchmark here: Guillaume Marceau: The speed, size and dependability of programming languages We can see the code conciseness vs. performance balance. But I guess it use the game benchmark when it was at https://shootout.alioth.debian.org (then it can have the same biases of the previous link)

Frederic_Loyer · March 23, 2026, 7:57pm

One intersting thing, I have taken one of the OCaml contribution: fannkuchredux (not the faster, since the faster uses process spawning), then convert it in C (with the help of Perplexity… it is quite convenient, but I have to compare both version and fix some two curious bugs in the translation).

Then:

$ time ./fannkuch_redux_c.exe 11
556355
Pfannkuchen(11) = 51

real    0m3.561s
user    0m0.000s
sys     0m0.078s

frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./fannkuch_redux_ml.exe 11
556355
Pfannkuchen(11) = 51

real    0m4.207s
user    0m0.031s
sys     0m0.000s

I find this comparison very good for OCaml. The C code is quite usual (idiomatic). I could have written the same manualy. But we note a function passing and a recursive function. In the other hand, the code use many array accesses where C can be quite faster (no bound checking), then I am very surprised.

EDIT : With n_body… here a heavy float computation:

frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./n_body_ml.exe 100000000
-0.169075164
-0.169035465

real    0m10.632s
user    0m0.015s
sys     0m0.000s

frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./n_body_c.exe 100000000
-0.169075164
-0.169035465

real    0m7.640s
user    0m0.000s
sys     0m0.047s

I thought that float boxing would make the C quite advantaged, but the difference is not that important.

However with spectral_norm, Many float arrays… (created simply with Array.make) and float ref:

$ time ./spectral_norm_ml.exe 5500
1.274224153

real    0m7.223s
user    0m0.031s
sys     0m0.015s

frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./spectral_norm_c.exe 5500
1.274224153

real    0m1.894s
user    0m0.031s
sys     0m0.031s

With fasta (conversion by Claude… Perplexity was too buggy). DNA handling (floats and char array… very closed results)

$ time ./fasta_ml.exe 25000000 |tail
gatctatgattcttaccagttgcaatattcaatttagcatgtgttcctaattattgtgtt
attatggtctatctcatcatgtaaatgaagatcatgacgtcaacacagattctagtcagg
atcatcagttcctcggggaaatcgcacctaggaacagccttatgcaaccgctaaacaaag
caatgaggatgtaccgacaaaagctcgatttaaaagcctcgaaacgagatgtacgaatcg
tttactgccttttatgaggagtcgagtactgttggttcatatttgctacatgattgtatg
taataacgatcccgccctttatcggttcgatcctttatggcgataagttatgaatcgtca
gtatctttagatcaaaaactcaactagtacccagttccccggaggaacggtcatgattaa
tgcgttttacggtctcccgtccctcttcttgtcagaggaatcagtttcatccgatcccac
tcgatgattggtatagctatttgccgaaaagccacaacgtattcggtactatcttgtttg
attcccctgtatcttaattc

real    0m5.602s
user    0m0.234s
sys     0m1.608s

frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./fasta2_c.exe 25000000 |tail
gatctatgattcttaccagttgcaatattcaatttagcatgtgttcctaattattgtgtt
attatggtctatctcatcatgtaaatgaagatcatgacgtcaacacagattctagtcagg
atcatcagttcctcggggaaatcgcacctaggaacagccttatgcaaccgctaaacaaag
caatgaggatgtaccgacaaaagctcgatttaaaagcctcgaaacgagatgtacgaatcg
tttactgccttttatgaggagtcgagtactgttggttcatatttgctacatgattgtatg
taataacgatcccgccctttatcggttcgatcctttatggcgataagttatgaatcgtca
gtatctttagatcaaaaactcaactagtacccagttccccggaggaacggtcatgattaa
tgcgttttacggtctcccgtccctcttcttgtcagaggaatcagtttcatccgatcccac
tcgatgattggtatagctatttgccgaaaagccacaacgtattcggtactatcttgtttg
attcccctgtatcttaattc

real    0m5.083s
user    0m0.390s
sys     0m1.623s

With binary-trees. Note that Claude optimise the code with an ad’hoc pool allocation:

$ time ./binarytrees_c.exe 21
stretch tree of depth 22         check: 8388607
2097152  trees of depth 4        check: 65011712
524288   trees of depth 6        check: 66584576
131072   trees of depth 8        check: 66977792
32768    trees of depth 10       check: 67076096
8192     trees of depth 12       check: 67100672
2048     trees of depth 14       check: 67106816
512      trees of depth 16       check: 67108352
128      trees of depth 18       check: 67108736
32       trees of depth 20       check: 67108832
long lived tree of depth 21      check: 4194303

real    0m2.461s
user    0m0.000s
sys     0m0.047s

frede@LAPTOP-OHKJH7PV MINGW64 ~/bench
$ time ./binarytrees_ml.exe 21
stretch tree of depth 22         check: 8388607
2097152  trees of depth 4        check: 65011712
524288   trees of depth 6        check: 66584576
131072   trees of depth 8        check: 66977792
32768    trees of depth 10       check: 67076096
8192     trees of depth 12       check: 67100672
2048     trees of depth 14       check: 67106816
512      trees of depth 16       check: 67108352
128      trees of depth 18       check: 67108736
32       trees of depth 20       check: 67108832
long lived tree of depth 21      check: 4194303

real    0m9.143s
user    0m0.016s
sys     0m0.046s

If we replace the adhoc pool allocator with simple malloc/free…

$ time ./binarytrees2_c.exe 21
stretch tree of depth 22         check: 8388607
2097152  trees of depth 4        check: 65011712
524288   trees of depth 6        check: 66584576
131072   trees of depth 8        check: 66977792
32768    trees of depth 10       check: 67076096
8192     trees of depth 12       check: 67100672
2048     trees of depth 14       check: 67106816
512      trees of depth 16       check: 67108352
128      trees of depth 18       check: 67108736
32       trees of depth 20       check: 67108832
long lived tree of depth 21      check: 4194303

real    0m35.190s
user    0m0.000s
sys     0m0.061s

A typical (idiomatic) C program can be 4 times slower than its equivalent OCaml !!!

Not yet compared : mandelbrot (the OCaml submission I have seen uses a fork, but there is a “normal” contribution), reverse-complement : in my todo list, pidigits (it would be a comparison zarith vs. gmp more than compilers), regex-redux (it would be a comparison of both Regex libraries rather than the compiler itself).

Then, I find OCaml very efficient… excepted in one case where it could be interesting to analyse the OCaml program I have linked to discover why. And the C malloc/free (and surely new/free in C++) functions are very inefficient. We can avoid them if it is very critical, but who does ? (However, an had’hoc pool storage is more efficient than the OCaml allocator).

When computing the geometric mean of the ratios… OCaml is 1,9 times slower than C (with the ad’hoc allocator), and 1,1 slower than C (with its malloc/free).

(I have used gcc with -O3 option, and just ocamlopt)

See the code here.

Topic		Replies	Views
Significant performance difference between OCaml and F# Ecosystem	53	19699	July 9, 2022
Ocaml Bytecode performance? Learning	21	2880	October 11, 2023
How to speed up this function? Learning	33	3633	October 25, 2025
[ANN] reed-solomon-erasure 1.0.1 Ecosystem announce	13	1863	August 1, 2018
Mysterious performance difference between close to identical programs Learning multicore	8	1253	January 17, 2022

Leibniz pi calculation test - OCaml ranking

Related topics