Eio/multicore puzzling performance

R_Huxton · November 1, 2023, 10:11pm

I was playing around with eio and specifically their worker pool example.

I wrote various trivial worker functions to burn some CPU, just so I could see that it
genuinely was using multiple cores.

The intrec function below behaved as you would expect, but the floatrec one
actually got slower with more cores. So using two cores was roughly twice as
slow (elapsed time) as using one core. Which equuals four times the cpu used…

let intrec () =
  let rec single_sum acc b =
    if b <= 0 then acc
    else
      let new_acc = acc + b in
      let new_b = b - 1 in
      (single_sum [@tailcall]) new_acc new_b
  in
  let rec double_sum acc a b =
    if a <= 0 then acc
    else
      let new_acc = acc + single_sum 0 b in
      let new_a = a - 1 in
      (double_sum [@tailcall]) new_acc new_a b
  in
  double_sum 0 scale scale

let floatrec () =
  let rec single_sum acc b =
    if b <= 0.0 then acc
    else
      let new_acc = acc +. b in
      let new_b = b -. 1.0 in
      (single_sum [@tailcall]) new_acc new_b
  in
  let rec double_sum acc a b =
    if a <= 0.0 then acc
    else
      let new_acc = acc +. single_sum 0.0 b in
      let new_a = a -. 1.0 in
      (double_sum [@tailcall]) new_acc new_a b
  in
  let floatscale = float_of_int scale in
  double_sum 0.0 floatscale floatscale |> int_of_float

I initially expected it to be me writing horribly inefficient code and either
forcing the garbage-collector to run constantly or ripping though the stack. I
checked with perf though and tagged the functions with @tailcall and am
still none the wiser.

The full program and perf fragments are pasted here: 63a621e — paste.sr.ht

I’m hoping it will be something stupid I’ve done.

Oh - Ocaml 5.1, eio 0.12, Debian 12 on a Raspberry Pi 4.

TIA

yawaramin · November 2, 2023, 2:32am

It might be float boxing. See OCaml speed comparison - calculating pi with Leibniz - optimize? - #2 by copy

struktured · November 2, 2023, 3:06am

Maybe try batching elements per core with big arrays to avoid the boxed floats overhead?

R_Huxton · November 2, 2023, 7:09am

I was expecting the floating point run to be slower. The thing I can’t understand is that the timings are something like:

intrec, 1 core = 1 second elapsed, 1 second user cpu
interc, 2 cores = 0.5 second elapsed, 1 second user cpu
floatrec, 1 core = 5 seconds elapsed, 5 second user cpu
floatrec, 2 cores = 10 second elapsed, 20 seconds user cpu

So adding a second core effectively halves the speed (and 3 cores takes 30 seconds etc). There must be some interaction between the processes on each core. But I can’t see what.

R_Huxton · November 2, 2023, 8:14am

Aha! it’s nothing to do with inter-thread scheduling or anything like that:

master »~/dev/ocaml/eio_tests $ ./_build/default/bin/multicore_6.exe d 1 d 1
+Worker 1 : 3429680500000
+Elapsed time = 13.1 s

master »~/dev/ocaml/eio_tests $ parallel -j 2 -n 2 ./_build/default/bin/multicore_6.exe -- d 1 d 1
+Worker 1 : 3429680500000
+Elapsed time = 24.4 s
+Worker 1 : 3429680500000
+Elapsed time = 25.7 s

It’s presumably something in the Raspberry Pi itself (or perhaps a low-level library?). Maybe the cpu on the Pi4 has a single FPU and we’re contending for access to it.

soya · November 2, 2023, 9:29am

The PI4 which should be a Cortex-A72(?) spec says it has one FPU per core, in 64bit mode at least.

Check /proc/cpuinfo and wikipedia.

otini · November 2, 2023, 11:59am

I can’t reproduce on an Intel core i7 (4 physical cores), so I’d wager on a hardware peculiarity of the Pi.
Perf results: https://paste.osau.re/Qel#eyJhbGciOiJBMTI4Q0JDIiwiZXh0Ijp0cnVlLCJrIjoiMHMyUFI3WnZvRkxuTHBGVl9IQ2hXQSIsImtleV9vcHMiOlsiZW5jcnlwdCIsImRlY3J5cHQiXSwia3R5Ijoib2N0In0=

Edit.: the fact that it doesn’t happen with parallel points at something OCaml-specific though. Maybe a bug in the ARM runtime/generated code?

R_Huxton · November 2, 2023, 1:46pm

Ah but it did happen with “parallel” which doesn’t rule out ocaml, but does mean it’s nothing to do with the domain/eio threading stuff.

I shall (a) try it out on an intel laptop later and (b) see if I can get another language to show a similar slowdown. I suspect I’ll see the same results as you on intel (thanks for testing it by the way).

otini · November 2, 2023, 7:51pm

Ah but it did happen with “parallel”

Ah, yes, I read too fast.

Topic		Replies	Views
OCaml 5 performance Ecosystem multicore , performance , profiling , eio	30	2933	September 11, 2024
Eio Digest #1 (September 2023) Community multicore , eio , eio-digest	0	925	September 6, 2023
Lwt multi-processing much more performant than eio multi-core? Ecosystem lwt , http , benchmark , httpaf , eio	10	630	April 2, 2025
Multicore, Async, and Lwt Ecosystem multicore , lwt , async	17	6588	September 15, 2023
Eio 0.1 - effects-based direct-style IO for OCaml 5 Community multicore , effects , concurrency	95	9819	February 23, 2022

Eio/multicore puzzling performance

Related topics