Understanding performance of a multi-core program

I’ve been playing with 5.0-beta1 and Domains. As an exercise I implemented a very simple worker pool in a few different ways:

  1. using multi-processing (MP): mp_queue.ml
  2. using Domains: domain_queue.ml
  3. using multi-threading in Rust: tq

I tried to translate (1) to (2) as directly as possible. The programs are not supposed to be a good real-world benchmark but rather something to help me build the intuition about multi-core.

Anyhow, this is what I expected:

  1. tq to be the fastest,
  2. more or less parity between mp_queue and domain_queue.

Actual results were surprising and somewhat all over the place, but generally

  1. Rust’s version was ~2x faster than domains (which makes sense and acts as a baseline),
  2. Multi-processed version was not slower than domains-based and often faster (up to 50%).

I’d appreciate any input that would clarify what’s going on here.


The following commands are run from the repo root (requires dune build @all --profile=release from root and cargo build --release from rust/ folder).

UPDATE 1: Don’t mind the Rust’s numbers – they need to be redone.
UPDATE 2: I rerun the benchmarks with a fixed Rust impl and with O3 flag for ocamlopt).

1. First example matches my intuition: Rust is the fastest and more-or-less parity between domains and MP.
WSL Linux, Intel i7-12700K:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark #1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     687.9 ms ±  29.9 ms    [User: 1.918 s, System: 0.001 s]
  Range (min … max):   651.3 ms … 758.8 ms    20 runs

Benchmark #2: _build/default/domain_queue.exe 40
  Time (mean ± σ):     737.5 ms ±  64.4 ms    [User: 1.490 s, System: 0.006 s]
  Range (min … max):   575.3 ms … 818.5 ms    20 runs

Benchmark #3: rust/target/release/tq 40
  Time (mean ± σ):     328.3 ms ±  70.5 ms    [User: 845.9 ms, System: 5.9 ms]
  Range (min … max):   237.6 ms … 508.8 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    2.10 ± 0.46 times faster than '_build/default/mp_queue.exe 40'
    2.25 ± 0.52 times faster than '_build/default/domain_queue.exe 40'

2. Second example contradicts my intuition: Rust’s baseline is still the same but the domains version is 50% slower than MP!
Server Linux, Intel Xeon

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     788.7 ms ±  20.2 ms    [User: 2203.7 ms, System: 5.0 ms]
  Range (min … max):   767.7 ms … 838.7 ms    20 runs

Benchmark 2: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.192 s ±  0.087 s    [User: 2.312 s, System: 0.007 s]
  Range (min … max):    1.070 s …  1.447 s    20 runs

Benchmark 3: rust/target/release/tq 40
  Time (mean ± σ):     574.3 ms ±  87.3 ms    [User: 1387.5 ms, System: 1.0 ms]
  Range (min … max):   434.1 ms … 742.2 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    1.37 ± 0.21 times faster than '_build/default/mp_queue.exe 40'
    2.08 ± 0.35 times faster than '_build/default/domain_queue.exe 40'

3. Third example contradicts my intuition but to a smaller degree: rust is the fastest but MP is 25% faster than Domains.
MacOS 12.6, 2.4 GHz 8-Core Intel Core i9

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     797.4 ms ±  13.2 ms    [User: 2225.6 ms, System: 9.8 ms]
  Range (min … max):   772.6 ms … 832.5 ms    20 runs

Benchmark 2: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.005 s ±  0.086 s    [User: 1.892 s, System: 0.011 s]
  Range (min … max):    0.887 s …  1.174 s    20 runs

Benchmark 3: rust/target/release/tq 40
  Time (mean ± σ):     419.4 ms ±  61.1 ms    [User: 1146.7 ms, System: 2.2 ms]
  Range (min … max):   331.2 ms … 553.1 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    1.90 ± 0.28 times faster than '_build/default/mp_queue.exe 40'
    2.40 ± 0.40 times faster than '_build/default/domain_queue.exe 40'

4. Final example looks closer to example #2, however here we have a different CPU arch and a different OS.
MacOS 12.6, Apple M1 Pro:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     610.7 ms ±   1.3 ms    [User: 1698.7 ms, System: 6.8 ms]
  Range (min … max):   607.9 ms … 612.7 ms    20 runs

Benchmark 2: _build/default/domain_queue.exe 40
  Time (mean ± σ):     874.2 ms ±  60.2 ms    [User: 1714.2 ms, System: 5.7 ms]
  Range (min … max):   776.0 ms … 1025.0 ms    20 runs

Benchmark 3: rust/target/release/tq 40
  Time (mean ± σ):     469.0 ms ±  79.5 ms    [User: 1318.1 ms, System: 3.0 ms]
  Range (min … max):   381.5 ms … 740.2 ms    20 runs

Summary
  'rust/target/release/tq 40' ran
    1.30 ± 0.22 times faster than '_build/default/mp_queue.exe 40'
    1.86 ± 0.34 times faster than '_build/default/domain_queue.exe 40'

I don’t have any other hardware/OS to test on. But so far, neither MacOS vs Linux, x86_64 vs aarch64 explain the difference in behaviour.

2 Likes

Two comments:

  • At first glance, the Rust and OCaml version of the fib functions differ (n <= 2 vs n < 2). That could explain why rust is slower than expected.
  • Your workers never communicate between them after the initial setup, so (at least for OCaml, I don’t know much about Rust) you pay the cost of synchronisation without any benefits. A few experiments with perf show that they execute roughly the same number of instructions, but the domain-based one spends much more time doing nothing (for example, more than ten times the number of context switches compared to the multi-process version). The main surprise is that this does not seem to have any impact on WSL.

@vlaviron

At first glance, the Rust and OCaml version of the fib functions differ (n <= 2 vs n < 2). That could explain why rust is slower than expected.

Haha, of course it had to be an off-by-one! Good spot! Fixing this makes the Rust version the fastest; so at least the baseline now makes sense (I’ll try to re-run the benchmarks sometime tomorrow).

so (at least for OCaml, I don’t know much about Rust) you pay the cost of synchronisation without any benefit

What sort of synchronization do you have in mind? Each worker thread communicates with the main thread via its own unbounded channel and sending and receiving from a channel should be (I’d expect) a couple compare-exchange operations. Code-wise synchronization points are not obvious to me.

I believe that the multicore GC needs synchronisation between all the domains in a variety of events, for instance minor GC. I suspect that this (or something similar) could end up stalling your program regularly.

But I don’t have enough knowledge to give a more precise answer; in particular I don’t know if you’re hitting a bug in the multicore GC (stalling much more than necessary) or if this is expected.

I have a very simple method to assess if something is parallelized well:

  • fire up gkrellm (or htop) and check that all cores are busy

Load distribution looks pretty similar between MP and Domain versions:
2022-10-26 10.45.42

I’m a bit surprised that the Domain version has 2x the threads. Not sure if it’s the runtime or an artifact of htop.

Each domain has a backup thread to service GC interrupts while the main thread is in a blocking system call.

2 Likes

@gadmm thanks for the insight! To clarify, you’re saying that if I spawn 100 domains, the runtime will spawn 100 more threads to service GC that will be there for the whole duration of the program’s execution?

That would explain some extra context switches @vlaviron saw, but probably not the 10x difference…?

There will still be inter-thread synchronization even if the program is not taking advantage of the shared global heap. I could be wrong, but at a high level I am not confident that performance gains should be expected if a multiprocessing program is rewritten to use domains without also taking advantage of the shared heap to for instance change inter-process communication using marshaled values / shared file system / shared database / etc. to just passing a pointer.

I haven’t looked at any IR/asm dumps, but guess that your fib function needs a polling point on each iteration. Since it has a very short body, it would not be surprising to me if that dominated. It would also be more informative to use a function that allocates instead of fib.

2 Likes

Good point @jjb!

I am not confident that performance gains should be expected if a multiprocessing program is rewritten to use domains without also taking advantage of the shared heap

This makes sense. I didn’t expect gains, and was expecting more or less parity; what was surprising to me is to see up to 50% slow down.

I haven’t looked at any IR/asm dumps, but guess that your fib function needs a polling point on each iteration. Since it has a very short body, it would not be surprising to me if that dominated

Do you mean that it might just be a pathologic case for multi-core?

I looked at the assembly of fib. The only thing that stands out is a call to _caml_call_realloc_stack which calls _caml_try_realloc_stack, but I don’t see GC polls there. With that said, I’m not familiar with runtime internals, so might be missing something.


0000000100004f10 <_camlDune__exe__Domain_queue__fib_62>:
100004f10: 4c 8d 94 24 a8 fe ff ff      leaq    -344(%rsp), %r10
100004f18: 4d 3b 56 20                  cmpq    32(%r14), %r10
100004f1c: 72 45                        jb      0x100004f63 <_camlDune__exe__Domain_queue__fib_62+0x53>
100004f1e: 48 83 ec 18                  subq    $24, %rsp
100004f22: 48 83 f8 05                  cmpq    $5, %rax
100004f26: 7f 0c                        jg      0x100004f34 <_camlDune__exe__Domain_queue__fib_62+0x24>
100004f28: b8 03 00 00 00               movl    $3, %eax
100004f2d: 48 83 c4 18                  addq    $24, %rsp
100004f31: c3                           retq
100004f32: 66 90                        nop
100004f34: 48 89 04 24                  movq    %rax, (%rsp)
100004f38: 48 83 c0 fc                  addq    $-4, %rax
100004f3c: e8 cf ff ff ff               callq   0x100004f10 <_camlDune__exe__Domain_queue__fib_62>
100004f41: 48 89 44 24 08               movq    %rax, 8(%rsp)
100004f46: 48 8b 04 24                  movq    (%rsp), %rax
100004f4a: 48 83 c0 fe                  addq    $-2, %rax
100004f4e: e8 bd ff ff ff               callq   0x100004f10 <_camlDune__exe__Domain_queue__fib_62>
100004f53: 48 8b 5c 24 08               movq    8(%rsp), %rbx
100004f58: 48 01 d8                     addq    %rbx, %rax
100004f5b: 48 ff c8                     decq    %rax
100004f5e: 48 83 c4 18                  addq    $24, %rsp
100004f62: c3                           retq
100004f63: 6a 24                        pushq   $36
100004f65: e8 ba 06 07 00               callq   0x100075624 <_caml_call_realloc_stack>
100004f6a: 41 5a                        popq    %r10
100004f6c: eb b0                        jmp     0x100004f1e <_camlDune__exe__Domain_queue__fib_62+0xe>
100004f6e: 66 90                        nop

That might be the issue actually. If a thread want to start a minor collection, it has to wait until all other threads reach a poll point, and your fib function only reaches one when reallocating the stack.
Normally recursive functions must contain a poll point, but it’s possible that for non-tail recursion we omit the poll because we assume that unbounded recursion would quickly reach the stack limit and trigger reallocation. And your case shows that it’s dangerous, because with a fixed stack length you can still fit an arbitrary number of recursive calls, as long as you have a non-linear call graph.

Could you try replacing fib with the following implementation, and report results ? Making one of the calls tail-recursive should force the compiler to insert a poll, ensuring more timely synchronisation.

let rec fib_aux acc n =
  if n < 2 then acc + 1
  else fib_aux (fib_aux acc (n-2)) (n-1)
let fib n = fib_aux 0 n

Then this is There are not enough polls during non-tail recursive calls · Issue #11580 · ocaml/ocaml · GitHub.

1 Like

@vlaviron I can see a GC poll point added with a tail-rec fib impl:

0000000100004ee0 <_camlDune__exe__Domain_queue_tr__fib_aux_62>:
...
100004f1f: e8 9c 09 07 00               callq   0x1000758c0 <_caml_call_gc>
...

I don’t have access to all machines now, but preliminary switching to tail-rec fib helped the domains-based version (roughly 20%-25% improvement), but slowed down the MP version. Below _tr marks binaries with tail-rec fib:

MacOS 12.6, Intel Core i9:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     853.4 ms ±  10.7 ms    [User: 2373.7 ms, System: 13.5 ms]
  Range (min … max):   840.8 ms … 880.5 ms    20 runs

Benchmark 2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):      1.016 s ±  0.017 s    [User: 2.826 s, System: 0.016 s]
  Range (min … max):    0.999 s …  1.062 s    20 runs

Benchmark 3: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.064 s ±  0.106 s    [User: 2.016 s, System: 0.014 s]
  Range (min … max):    0.935 s …  1.319 s    20 runs

Benchmark 4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):     871.0 ms ±  20.1 ms    [User: 2414.3 ms, System: 11.7 ms]
  Range (min … max):   847.5 ms … 925.0 ms    20 runs

Summary
  '_build/default/mp_queue.exe 40' ran
    1.02 ± 0.03 times faster than '_build/default/domain_queue_tr.exe 40'
    1.19 ± 0.03 times faster than '_build/default/mp_queue_tr.exe 40'
    1.25 ± 0.12 times faster than '_build/default/domain_queue.exe 40'

Server Linux, Intel Xeon:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     835.5 ms ±  17.2 ms    [User: 2336.6 ms, System: 6.1 ms]
  Range (min … max):   807.5 ms … 871.4 ms    20 runs

Benchmark 2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):      1.188 s ±  0.032 s    [User: 3.327 s, System: 0.005 s]
  Range (min … max):    1.158 s …  1.296 s    20 runs

Benchmark 3: _build/default/domain_queue.exe 40
  Time (mean ± σ):      1.246 s ±  0.084 s    [User: 2.401 s, System: 0.006 s]
  Range (min … max):    1.124 s …  1.466 s    20 runs

Benchmark 4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):      1.048 s ±  0.020 s    [User: 2.933 s, System: 0.004 s]
  Range (min … max):    0.994 s …  1.092 s    20 runs

Summary
  '_build/default/mp_queue.exe 40' ran
    1.25 ± 0.04 times faster than '_build/default/domain_queue_tr.exe 40'
    1.42 ± 0.05 times faster than '_build/default/mp_queue_tr.exe 40'
    1.49 ± 0.11 times faster than '_build/default/domain_queue.exe 40'
2 Likes

Ran on the 2 remaining machines. Switching to a tail recursive version seem to have helped on an Intel machine, but on Apple Silicon it made things quite a bit slower.

WSL Linux, Intel i7-12700K:

hyperfine '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark #1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     677.3 ms ±  15.6 ms    [User: 1.906 s, System: 0.002 s]
  Range (min … max):   656.1 ms … 706.2 ms    10 runs

Benchmark #2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):     647.6 ms ±  17.7 ms    [User: 1.829 s, System: 0.002 s]
  Range (min … max):   616.6 ms … 677.7 ms    10 runs

Benchmark #3: _build/default/domain_queue.exe 40
  Time (mean ± σ):     744.2 ms ±  63.4 ms    [User: 1.492 s, System: 0.002 s]
  Range (min … max):   602.2 ms … 824.2 ms    10 runs

Benchmark #4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):     621.6 ms ±  18.9 ms    [User: 1.728 s, System: 0.000 s]
  Range (min … max):   599.1 ms … 655.0 ms    10 runs

Summary
  '_build/default/domain_queue_tr.exe 40' ran
    1.04 ± 0.04 times faster than '_build/default/mp_queue_tr.exe 40'
    1.09 ± 0.04 times faster than '_build/default/mp_queue.exe 40'
    1.20 ± 0.11 times faster than '_build/default/domain_queue.exe 40'

MacOS 12.6, Apple M1 Pro:

hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/mp_queue_tr.exe 40' '_build/default/domain_queue.exe 40' '_build/default/domain_queue_tr.exe 40'
Benchmark 1: _build/default/mp_queue.exe 40
  Time (mean ± σ):     612.9 ms ±   0.9 ms    [User: 1699.4 ms, System: 8.9 ms]
  Range (min … max):   611.2 ms … 614.5 ms    20 runs

Benchmark 2: _build/default/mp_queue_tr.exe 40
  Time (mean ± σ):     900.3 ms ±   2.0 ms    [User: 2503.6 ms, System: 9.5 ms]
  Range (min … max):   896.7 ms … 904.2 ms    20 runs

Benchmark 3: _build/default/domain_queue.exe 40
  Time (mean ± σ):     896.6 ms ±  85.7 ms    [User: 1719.2 ms, System: 7.8 ms]
  Range (min … max):   802.6 ms … 1105.5 ms    20 runs

Benchmark 4: _build/default/domain_queue_tr.exe 40
  Time (mean ± σ):      1.041 s ±  0.002 s    [User: 2.900 s, System: 0.008 s]
  Range (min … max):    1.038 s …  1.043 s    20 runs

Summary
  '_build/default/mp_queue.exe 40' ran
    1.46 ± 0.14 times faster than '_build/default/domain_queue.exe 40'
    1.47 ± 0.00 times faster than '_build/default/mp_queue_tr.exe 40'
    1.70 ± 0.00 times faster than '_build/default/domain_queue_tr.exe 40'