OCaml speed comparison - calculating pi with Leibniz - optimize?

I have been nerd-sniped by this speed comparison repo: GitHub - niklas-heer/speed-comparison: A repo which compares the speed of different programming languages.

And implemented the following OCaml version of Leibniz’ algorithm to calculate pi (well, not so much implemented as ported over the Python version):

(* ocamlopt -O2 -o leibniz leibniz.ml && strip leibniz *)

let rec calc sum curr upto =
  if curr >= upto then sum
  else calc (sum +. 1. /. float_of_int curr) (curr + 4) upto

let calc rounds =
  let double = rounds * 2 in
  calc 0. (1 - double) (double + 1)

let () =
  let rounds_txt = open_in_bin "rounds.txt" in
  let rounds = int_of_string (input_line rounds_txt) in
  close_in rounds_txt;

  let pi = 4. *. calc rounds in
  Printf.printf "%.16f\n" pi

When I run this with time ./leibniz I consistently get:

$ time ./leibniz
3.1415926435899633
./leibniz  0.38s user 0.00s system 99% cpu 0.388 total

Am I missing anything obvious? The Go version runs in a fraction of that time:

time ./leibniz_go
3.141592663589326
./leibniz_go  0.13s user 0.01s system 95% cpu 0.142 total

Judging from the generated assembly, the OCaml compiler is not able to unbox the float (for the return value) across the recursive call. You can see the subq $16, %r15 cmpq (%r14), %r15 allocation sequence, as well as memory writes inside of the loop of camlExample__calc_5.

An imperative version does quite a bit better (1.235 vs 2.273 on my machine).

2 Likes

Probably the old boxing-of-floats issue.
EDIT: sniped

To make numeric code go faster, I suggest avoiding recursive calls (which inhibit float unboxing) and sticking to good old while loops, ie something like:

let calc rounds =
  let sum = ref 0. in
  let double = 2 * rounds in
  let curr = ref (1 - double) in
  let upto = double + 1 in
  while !curr < upto do
    sum := !sum +. 1. /. float_of_int !curr;
    curr := !curr + 4
  done;
  !sum

Cheers,
Nicolas

4 Likes

So now that we’ve brought this issue up again, is there any work being done to fix it? Will Flambda2 deal with it?

1 Like

Thanks all! That makes sense.

Moving float_of_int outside of the loop retires 200M fewer instructions according to perf stat but doesn’t actually speed things up on my end.

Yes, and in some cases yes.
The best way to fix it is to use Jane Street’s unboxed types (proposed as an RFC a while ago, with significant parts already implemented). Without that, Flambda 2 can still unbox the float values in the above example, but it’s based on heuristics so there’s no guarantee that it will keep working if small changes are made. Typically, if the function wasn’t tail-recursive it wouldn’t work.

4 Likes

Just to update here that with the recommended imperative approach the time on my laptop now matches the Go time, 0.13s. Just for fun, I’ll try submitting a PR to that repo.

3 Likes