Understanding performance of a multi-core program

arr · October 25, 2022, 8:29pm

I’ve been playing with 5.0-beta1 and Domains. As an exercise I implemented a very simple worker pool in a few different ways:

using multi-processing (MP): mp_queue.ml
using Domains: domain_queue.ml
using multi-threading in Rust: tq

I tried to translate (1) to (2) as directly as possible. The programs are not supposed to be a good real-world benchmark but rather something to help me build the intuition about multi-core.

Anyhow, this is what I expected:

tq to be the fastest,
more or less parity between mp_queue and domain_queue.

Actual results were surprising and somewhat all over the place, but generally

Rust’s version was ~2x faster than domains (which makes sense and acts as a baseline),
Multi-processed version was not slower than domains-based and often faster (up to 50%).

I’d appreciate any input that would clarify what’s going on here.

The following commands are run from the repo root (requires dune build @all --profile=release from root and cargo build --release from rust/ folder).

UPDATE 1: Don’t mind the Rust’s numbers – they need to be redone.
UPDATE 2: I rerun the benchmarks with a fixed Rust impl and with O3 flag for ocamlopt).

1. First example matches my intuition: Rust is the fastest and more-or-less parity between domains and MP.
WSL Linux, Intel i7-12700K:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark #1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     687.9 ms ±  29.9 ms    [User: 1.918 s, System: 0.001 s]
  Range (min … max):   651.3 ms … 758.8 ms    20 runs

Benchmark #2: _build/default/domain_queue.exe 40
  Time (mean ± σ):     737.5 ms ±  64.4 ms    [User: 1.490 s, System: 0.006 s]
  Range (min … max):   575.3 ms … 818.5 ms    20 runs

Benchmark #3: rust/target/release/tq 40
  Time (mean ± σ):     328.3 ms ±  70.5 ms    [User: 845.9 ms, System: 5.9 ms]
  Range (min … max):   237.6 ms … 508.8 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    2.10 ± 0.46 times faster than '_build/default/mp_queue.exe 40'
    2.25 ± 0.52 times faster than '_build/default/domain_queue.exe 40'

2. Second example contradicts my intuition: Rust’s baseline is still the same but the domains version is 50% slower than MP!
Server Linux, Intel Xeon

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     788.7 ms ±  20.2 ms    [User: 2203.7 ms, System: 5.0 ms]
  Range (min … max):   767.7 ms … 838.7 ms    20 runs

Benchmark 2: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.192 s ±  0.087 s    [User: 2.312 s, System: 0.007 s]
  Range (min … max):    1.070 s …  1.447 s    20 runs

Benchmark 3: rust/target/release/tq 40
  Time (mean ± σ):     574.3 ms ±  87.3 ms    [User: 1387.5 ms, System: 1.0 ms]
  Range (min … max):   434.1 ms … 742.2 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    1.37 ± 0.21 times faster than '_build/default/mp_queue.exe 40'
    2.08 ± 0.35 times faster than '_build/default/domain_queue.exe 40'

3. Third example contradicts my intuition but to a smaller degree: rust is the fastest but MP is 25% faster than Domains.
MacOS 12.6, 2.4 GHz 8-Core Intel Core i9

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     797.4 ms ±  13.2 ms    [User: 2225.6 ms, System: 9.8 ms]
  Range (min … max):   772.6 ms … 832.5 ms    20 runs

Benchmark 2: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.005 s ±  0.086 s    [User: 1.892 s, System: 0.011 s]
  Range (min … max):    0.887 s …  1.174 s    20 runs

Benchmark 3: rust/target/release/tq 40
  Time (mean ± σ):     419.4 ms ±  61.1 ms    [User: 1146.7 ms, System: 2.2 ms]
  Range (min … max):   331.2 ms … 553.1 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    1.90 ± 0.28 times faster than '_build/default/mp_queue.exe 40'
    2.40 ± 0.40 times faster than '_build/default/domain_queue.exe 40'

4. Final example looks closer to example #2, however here we have a different CPU arch and a different OS.
MacOS 12.6, Apple M1 Pro:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     610.7 ms ±   1.3 ms    [User: 1698.7 ms, System: 6.8 ms]
  Range (min … max):   607.9 ms … 612.7 ms    20 runs

Benchmark 2: _build/default/domain_queue.exe 40
  Time (mean ± σ):     874.2 ms ±  60.2 ms    [User: 1714.2 ms, System: 5.7 ms]
  Range (min … max):   776.0 ms … 1025.0 ms    20 runs

Benchmark 3: rust/target/release/tq 40
  Time (mean ± σ):     469.0 ms ±  79.5 ms    [User: 1318.1 ms, System: 3.0 ms]
  Range (min … max):   381.5 ms … 740.2 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    1.30 ± 0.22 times faster than '_build/default/mp_queue.exe 40'
    1.86 ± 0.34 times faster than '_build/default/domain_queue.exe 40'

I don’t have any other hardware/OS to test on. But so far, neither MacOS vs Linux, x86_64 vs aarch64 explain the difference in behaviour.

vlaviron · October 25, 2022, 10:05pm

Two comments:

At first glance, the Rust and OCaml version of the fib functions differ (n <= 2 vs n < 2). That could explain why rust is slower than expected.
Your workers never communicate between them after the initial setup, so (at least for OCaml, I don’t know much about Rust) you pay the cost of synchronisation without any benefits. A few experiments with perf show that they execute roughly the same number of instructions, but the domain-based one spends much more time doing nothing (for example, more than ten times the number of context switches compared to the multi-process version). The main surprise is that this does not seem to have any impact on WSL.

arr · October 25, 2022, 10:47pm

@vlaviron

At first glance, the Rust and OCaml version of the fib functions differ (n <= 2 vs n < 2). That could explain why rust is slower than expected.

Haha, of course it had to be an off-by-one! Good spot! Fixing this makes the Rust version the fastest; so at least the baseline now makes sense (I’ll try to re-run the benchmarks sometime tomorrow).

so (at least for OCaml, I don’t know much about Rust) you pay the cost of synchronisation without any benefit

What sort of synchronization do you have in mind? Each worker thread communicates with the main thread via its own unbounded channel and sending and receiving from a channel should be (I’d expect) a couple compare-exchange operations. Code-wise synchronization points are not obvious to me.

vlaviron · October 25, 2022, 11:12pm

I believe that the multicore GC needs synchronisation between all the domains in a variety of events, for instance minor GC. I suspect that this (or something similar) could end up stalling your program regularly.

But I don’t have enough knowledge to give a more precise answer; in particular I don’t know if you’re hitting a bug in the multicore GC (stalling much more than necessary) or if this is expected.

UnixJunkie · October 26, 2022, 1:25am

I have a very simple method to assess if something is parallelized well:

fire up gkrellm (or htop) and check that all cores are busy

arr · October 26, 2022, 9:50am

Load distribution looks pretty similar between MP and Domain versions:
2022-10-26 10.45.42

I’m a bit surprised that the Domain version has 2x the threads. Not sure if it’s the runtime or an artifact of htop.

gadmm · October 26, 2022, 10:00am

Each domain has a backup thread to service GC interrupts while the main thread is in a blocking system call.

arr · October 26, 2022, 5:33pm

@gadmm thanks for the insight! To clarify, you’re saying that if I spawn 100 domains, the runtime will spawn 100 more threads to service GC that will be there for the whole duration of the program’s execution?

That would explain some extra context switches @vlaviron saw, but probably not the 10x difference…?

jjb · October 26, 2022, 7:37pm

There will still be inter-thread synchronization even if the program is not taking advantage of the shared global heap. I could be wrong, but at a high level I am not confident that performance gains should be expected if a multiprocessing program is rewritten to use domains without also taking advantage of the shared heap to for instance change inter-process communication using marshaled values / shared file system / shared database / etc. to just passing a pointer.

I haven’t looked at any IR/asm dumps, but guess that your fib function needs a polling point on each iteration. Since it has a very short body, it would not be surprising to me if that dominated. It would also be more informative to use a function that allocates instead of fib.

arr · October 27, 2022, 10:32am

Good point @jjb!

I am not confident that performance gains should be expected if a multiprocessing program is rewritten to use domains without also taking advantage of the shared heap

This makes sense. I didn’t expect gains, and was expecting more or less parity; what was surprising to me is to see up to 50% slow down.

I haven’t looked at any IR/asm dumps, but guess that your fib function needs a polling point on each iteration. Since it has a very short body, it would not be surprising to me if that dominated

Do you mean that it might just be a pathologic case for multi-core?

I looked at the assembly of fib. The only thing that stands out is a call to _caml_call_realloc_stack which calls _caml_try_realloc_stack, but I don’t see GC polls there. With that said, I’m not familiar with runtime internals, so might be missing something.

0000000100004f10 <_camlDune__exe__Domain_queue__fib_62>:
100004f10: 4c 8d 94 24 a8 fe ff ff      leaq    -344(%rsp), %r10
100004f18: 4d 3b 56 20                  cmpq    32(%r14), %r10
100004f1c: 72 45                        jb      0x100004f63 <_camlDune__exe__Domain_queue__fib_62+0x53>
100004f1e: 48 83 ec 18                  subq    $24, %rsp
100004f22: 48 83 f8 05                  cmpq    $5, %rax
100004f26: 7f 0c                        jg      0x100004f34 <_camlDune__exe__Domain_queue__fib_62+0x24>
100004f28: b8 03 00 00 00               movl    $3, %eax
100004f2d: 48 83 c4 18                  addq    $24, %rsp
100004f31: c3                           retq
100004f32: 66 90                        nop
100004f34: 48 89 04 24                  movq    %rax, (%rsp)
100004f38: 48 83 c0 fc                  addq    $-4, %rax
100004f3c: e8 cf ff ff ff               callq   0x100004f10 <_camlDune__exe__Domain_queue__fib_62>
100004f41: 48 89 44 24 08               movq    %rax, 8(%rsp)
100004f46: 48 8b 04 24                  movq    (%rsp), %rax
100004f4a: 48 83 c0 fe                  addq    $-2, %rax
100004f4e: e8 bd ff ff ff               callq   0x100004f10 <_camlDune__exe__Domain_queue__fib_62>
100004f53: 48 8b 5c 24 08               movq    8(%rsp), %rbx
100004f58: 48 01 d8                     addq    %rbx, %rax
100004f5b: 48 ff c8                     decq    %rax
100004f5e: 48 83 c4 18                  addq    $24, %rsp
100004f62: c3                           retq
100004f63: 6a 24                        pushq   $36
100004f65: e8 ba 06 07 00               callq   0x100075624 <_caml_call_realloc_stack>
100004f6a: 41 5a                        popq    %r10
100004f6c: eb b0                        jmp     0x100004f1e <_camlDune__exe__Domain_queue__fib_62+0xe>
100004f6e: 66 90                        nop

vlaviron · October 27, 2022, 11:26am

That might be the issue actually. If a thread want to start a minor collection, it has to wait until all other threads reach a poll point, and your fib function only reaches one when reallocating the stack.
Normally recursive functions must contain a poll point, but it’s possible that for non-tail recursion we omit the poll because we assume that unbounded recursion would quickly reach the stack limit and trigger reallocation. And your case shows that it’s dangerous, because with a fixed stack length you can still fit an arbitrary number of recursive calls, as long as you have a non-linear call graph.

Could you try replacing fib with the following implementation, and report results ? Making one of the calls tail-recursive should force the compiler to insert a poll, ensuring more timely synchronisation.

let rec fib_aux acc n =
  if n < 2 then acc + 1
  else fib_aux (fib_aux acc (n-2)) (n-1)
let fib n = fib_aux 0 n

gadmm · October 27, 2022, 12:40pm

Then this is There are not enough polls during non-tail recursive calls · Issue #11580 · ocaml/ocaml · GitHub.

arr · October 27, 2022, 12:51pm

@vlaviron I can see a GC poll point added with a tail-rec fib impl:

0000000100004ee0 <_camlDune__exe__Domain_queue_tr__fib_aux_62>:
...
100004f1f: e8 9c 09 07 00               callq   0x1000758c0 <_caml_call_gc>
...

I don’t have access to all machines now, but preliminary switching to tail-rec fib helped the domains-based version (roughly 20%-25% improvement), but slowed down the MP version. Below _tr marks binaries with tail-rec fib:

MacOS 12.6, Intel Core i9:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     853.4 ms ±  10.7 ms    [User: 2373.7 ms, System: 13.5 ms]
  Range (min … max):   840.8 ms … 880.5 ms    20 runs

Benchmark 2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):      1.016 s ±  0.017 s    [User: 2.826 s, System: 0.016 s]
  Range (min … max):    0.999 s …  1.062 s    20 runs

Benchmark 3: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.064 s ±  0.106 s    [User: 2.016 s, System: 0.014 s]
  Range (min … max):    0.935 s …  1.319 s    20 runs

Benchmark 4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):     871.0 ms ±  20.1 ms    [User: 2414.3 ms, System: 11.7 ms]
  Range (min … max):   847.5 ms … 925.0 ms    20 runs

Summary
  '_build/default/mp_queue.exe 40' ran
    1.02 ± 0.03 times faster than '_build/default/domain_queue_tr.exe 40'
    1.19 ± 0.03 times faster than '_build/default/mp_queue_tr.exe 40'
    1.25 ± 0.12 times faster than '_build/default/domain_queue.exe 40'

Server Linux, Intel Xeon:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     835.5 ms ±  17.2 ms    [User: 2336.6 ms, System: 6.1 ms]
  Range (min … max):   807.5 ms … 871.4 ms    20 runs

Benchmark 2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):      1.188 s ±  0.032 s    [User: 3.327 s, System: 0.005 s]
  Range (min … max):    1.158 s …  1.296 s    20 runs

Benchmark 3: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.246 s ±  0.084 s    [User: 2.401 s, System: 0.006 s]
  Range (min … max):    1.124 s …  1.466 s    20 runs

Benchmark 4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):      1.048 s ±  0.020 s    [User: 2.933 s, System: 0.004 s]
  Range (min … max):    0.994 s …  1.092 s    20 runs

Summary
  '_build/default/mp_queue.exe 40' ran
    1.25 ± 0.04 times faster than '_build/default/domain_queue_tr.exe 40'
    1.42 ± 0.05 times faster than '_build/default/mp_queue_tr.exe 40'
    1.49 ± 0.11 times faster than '_build/default/domain_queue.exe 40'

arr · October 30, 2022, 10:12am

Ran on the 2 remaining machines. Switching to a tail recursive version seem to have helped on an Intel machine, but on Apple Silicon it made things quite a bit slower.

WSL Linux, Intel i7-12700K:

hyperfine '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark #1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     677.3 ms ±  15.6 ms    [User: 1.906 s, System: 0.002 s]
  Range (min … max):   656.1 ms … 706.2 ms    10 runs

Benchmark #2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):     647.6 ms ±  17.7 ms    [User: 1.829 s, System: 0.002 s]
  Range (min … max):   616.6 ms … 677.7 ms    10 runs

Benchmark #3: _build/default/domain_queue.exe 40
  Time (mean ± σ):     744.2 ms ±  63.4 ms    [User: 1.492 s, System: 0.002 s]
  Range (min … max):   602.2 ms … 824.2 ms    10 runs

Benchmark #4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):     621.6 ms ±  18.9 ms    [User: 1.728 s, System: 0.000 s]
  Range (min … max):   599.1 ms … 655.0 ms    10 runs

Summary
  '_build/default/domain_queue_tr.exe 40' ran
    1.04 ± 0.04 times faster than '_build/default/mp_queue_tr.exe 40'
    1.09 ± 0.04 times faster than '_build/default/mp_queue.exe 40'
    1.20 ± 0.11 times faster than '_build/default/domain_queue.exe 40'

MacOS 12.6, Apple M1 Pro:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     612.9 ms ±   0.9 ms    [User: 1699.4 ms, System: 8.9 ms]
  Range (min … max):   611.2 ms … 614.5 ms    20 runs

Benchmark 2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):     900.3 ms ±   2.0 ms    [User: 2503.6 ms, System: 9.5 ms]
  Range (min … max):   896.7 ms … 904.2 ms    20 runs

Benchmark 3: _build/default/domain_queue.exe 40
  Time (mean ± σ):     896.6 ms ±  85.7 ms    [User: 1719.2 ms, System: 7.8 ms]
  Range (min … max):   802.6 ms … 1105.5 ms    20 runs

Benchmark 4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):      1.041 s ±  0.002 s    [User: 2.900 s, System: 0.008 s]
  Range (min … max):    1.038 s …  1.043 s    20 runs

Summary
  '_build/default/mp_queue.exe 40' ran
    1.46 ± 0.14 times faster than '_build/default/domain_queue.exe 40'
    1.47 ± 0.00 times faster than '_build/default/mp_queue_tr.exe 40'
    1.70 ± 0.00 times faster than '_build/default/domain_queue_tr.exe 40'

Topic		Replies	Views
Comprehensive developer doc for multicore base modules Learning multicore	3	909	October 20, 2021
Parany for multicore OCaml Community multicore , parallel-programming	2	912	September 16, 2021
OCaml 5 performance Ecosystem multicore , performance , profiling , eio	30	3073	September 11, 2024
Shared memory examples with OCaml multicore and Hashtbl Learning multicore , multiprocessing	4	1030	November 18, 2021
[ANN] Domainslib 0.5.0 released Community multicore , announce	0	840	October 12, 2022

Understanding performance of a multi-core program

Related topics