Eio/multicore puzzling performance

I was playing around with eio and specifically their worker pool example.

I wrote various trivial worker functions to burn some CPU, just so I could see that it
genuinely was using multiple cores.

The intrec function below behaved as you would expect, but the floatrec one
actually got slower with more cores. So using two cores was roughly twice as
slow (elapsed time) as using one core. Which equuals four times the cpu used…

let intrec () =
  let rec single_sum acc b =
    if b <= 0 then acc
    else
      let new_acc = acc + b in
      let new_b = b - 1 in
      (single_sum [@tailcall]) new_acc new_b
  in
  let rec double_sum acc a b =
    if a <= 0 then acc
    else
      let new_acc = acc + single_sum 0 b in
      let new_a = a - 1 in
      (double_sum [@tailcall]) new_acc new_a b
  in
  double_sum 0 scale scale

let floatrec () =
  let rec single_sum acc b =
    if b <= 0.0 then acc
    else
      let new_acc = acc +. b in
      let new_b = b -. 1.0 in
      (single_sum [@tailcall]) new_acc new_b
  in
  let rec double_sum acc a b =
    if a <= 0.0 then acc
    else
      let new_acc = acc +. single_sum 0.0 b in
      let new_a = a -. 1.0 in
      (double_sum [@tailcall]) new_acc new_a b
  in
  let floatscale = float_of_int scale in
  double_sum 0.0 floatscale floatscale |> int_of_float

I initially expected it to be me writing horribly inefficient code and either
forcing the garbage-collector to run constantly or ripping though the stack. I
checked with perf though and tagged the functions with @tailcall and am
still none the wiser.

The full program and perf fragments are pasted here: 63a621e — paste.sr.ht

I’m hoping it will be something stupid I’ve done.

Oh - Ocaml 5.1, eio 0.12, Debian 12 on a Raspberry Pi 4.

TIA

1 Like

It might be float boxing. See OCaml speed comparison - calculating pi with Leibniz - optimize? - #2 by copy

Maybe try batching elements per core with big arrays to avoid the boxed floats overhead?

I was expecting the floating point run to be slower. The thing I can’t understand is that the timings are something like:

  • intrec, 1 core = 1 second elapsed, 1 second user cpu
  • interc, 2 cores = 0.5 second elapsed, 1 second user cpu
  • floatrec, 1 core = 5 seconds elapsed, 5 second user cpu
  • floatrec, 2 cores = 10 second elapsed, 20 seconds user cpu

So adding a second core effectively halves the speed (and 3 cores takes 30 seconds etc). There must be some interaction between the processes on each core. But I can’t see what.

Aha! it’s nothing to do with inter-thread scheduling or anything like that:

master »~/dev/ocaml/eio_tests $ ./_build/default/bin/multicore_6.exe d 1 d 1
+Worker 1 : 3429680500000
+Elapsed time = 13.1 s

master »~/dev/ocaml/eio_tests $ parallel -j 2 -n 2 ./_build/default/bin/multicore_6.exe -- d 1 d 1
+Worker 1 : 3429680500000
+Elapsed time = 24.4 s
+Worker 1 : 3429680500000
+Elapsed time = 25.7 s

It’s presumably something in the Raspberry Pi itself (or perhaps a low-level library?). Maybe the cpu on the Pi4 has a single FPU and we’re contending for access to it.

3 Likes

The PI4 which should be a Cortex-A72(?) spec says it has one FPU per core, in 64bit mode at least.

Check /proc/cpuinfo and wikipedia.

I can’t reproduce on an Intel core i7 (4 physical cores), so I’d wager on a hardware peculiarity of the Pi.
Perf results: https://paste.osau.re/Qel#eyJhbGciOiJBMTI4Q0JDIiwiZXh0Ijp0cnVlLCJrIjoiMHMyUFI3WnZvRkxuTHBGVl9IQ2hXQSIsImtleV9vcHMiOlsiZW5jcnlwdCIsImRlY3J5cHQiXSwia3R5Ijoib2N0In0=

Edit.: the fact that it doesn’t happen with parallel points at something OCaml-specific though. Maybe a bug in the ARM runtime/generated code?

Ah but it did happen with “parallel” which doesn’t rule out ocaml, but does mean it’s nothing to do with the domain/eio threading stuff.

I shall (a) try it out on an intel laptop later and (b) see if I can get another language to show a similar slowdown. I suspect I’ll see the same results as you on intel (thanks for testing it by the way).

Ah but it did happen with “parallel”

Ah, yes, I read too fast.