I’ve been playing with 5.0-beta1 and Domains. As an exercise I implemented a very simple worker pool in a few different ways:
- using multi-processing (MP): mp_queue.ml
- using Domains: domain_queue.ml
- using multi-threading in Rust: tq
I tried to translate (1) to (2) as directly as possible. The programs are not supposed to be a good real-world benchmark but rather something to help me build the intuition about multi-core.
Anyhow, this is what I expected:
-
tq
to be the fastest, - more or less parity between
mp_queue
anddomain_queue
.
Actual results were surprising and somewhat all over the place, but generally
- Rust’s version was ~2x faster than domains (which makes sense and acts as a baseline),
- Multi-processed version was not slower than domains-based and often faster (up to 50%).
I’d appreciate any input that would clarify what’s going on here.
The following commands are run from the repo root (requires dune build @all --profile=release
from root and cargo build --release
from rust/
folder).
UPDATE 1: Don’t mind the Rust’s numbers – they need to be redone.
UPDATE 2: I rerun the benchmarks with a fixed Rust impl and with O3 flag for ocamlopt).
1. First example matches my intuition: Rust is the fastest and more-or-less parity between domains and MP.
WSL Linux, Intel i7-12700K:
hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark #1: _build/default/mp_queue.exe 40
Time (mean ± σ): 687.9 ms ± 29.9 ms [User: 1.918 s, System: 0.001 s]
Range (min … max): 651.3 ms … 758.8 ms 20 runs
Benchmark #2: _build/default/domain_queue.exe 40
Time (mean ± σ): 737.5 ms ± 64.4 ms [User: 1.490 s, System: 0.006 s]
Range (min … max): 575.3 ms … 818.5 ms 20 runs
Benchmark #3: rust/target/release/tq 40
Time (mean ± σ): 328.3 ms ± 70.5 ms [User: 845.9 ms, System: 5.9 ms]
Range (min … max): 237.6 ms … 508.8 ms 20 runs
Summary
'rust/target/release/tq 40' ran
2.10 ± 0.46 times faster than '_build/default/mp_queue.exe 40'
2.25 ± 0.52 times faster than '_build/default/domain_queue.exe 40'
2. Second example contradicts my intuition: Rust’s baseline is still the same but the domains version is 50% slower than MP!
Server Linux, Intel Xeon
hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
Time (mean ± σ): 788.7 ms ± 20.2 ms [User: 2203.7 ms, System: 5.0 ms]
Range (min … max): 767.7 ms … 838.7 ms 20 runs
Benchmark 2: _build/default/domain_queue.exe 40
Time (mean ± σ): 1.192 s ± 0.087 s [User: 2.312 s, System: 0.007 s]
Range (min … max): 1.070 s … 1.447 s 20 runs
Benchmark 3: rust/target/release/tq 40
Time (mean ± σ): 574.3 ms ± 87.3 ms [User: 1387.5 ms, System: 1.0 ms]
Range (min … max): 434.1 ms … 742.2 ms 20 runs
Summary
'rust/target/release/tq 40' ran
1.37 ± 0.21 times faster than '_build/default/mp_queue.exe 40'
2.08 ± 0.35 times faster than '_build/default/domain_queue.exe 40'
3. Third example contradicts my intuition but to a smaller degree: rust is the fastest but MP is 25% faster than Domains.
MacOS 12.6, 2.4 GHz 8-Core Intel Core i9
hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
Time (mean ± σ): 797.4 ms ± 13.2 ms [User: 2225.6 ms, System: 9.8 ms]
Range (min … max): 772.6 ms … 832.5 ms 20 runs
Benchmark 2: _build/default/domain_queue.exe 40
Time (mean ± σ): 1.005 s ± 0.086 s [User: 1.892 s, System: 0.011 s]
Range (min … max): 0.887 s … 1.174 s 20 runs
Benchmark 3: rust/target/release/tq 40
Time (mean ± σ): 419.4 ms ± 61.1 ms [User: 1146.7 ms, System: 2.2 ms]
Range (min … max): 331.2 ms … 553.1 ms 20 runs
Summary
'rust/target/release/tq 40' ran
1.90 ± 0.28 times faster than '_build/default/mp_queue.exe 40'
2.40 ± 0.40 times faster than '_build/default/domain_queue.exe 40'
4. Final example looks closer to example #2, however here we have a different CPU arch and a different OS.
MacOS 12.6, Apple M1 Pro:
hyperfine -w 5 -r 20 '_build/default/mp_queue.exe 40' '_build/default/domain_queue.exe 40' 'rust/target/release/tq 40'
Benchmark 1: _build/default/mp_queue.exe 40
Time (mean ± σ): 610.7 ms ± 1.3 ms [User: 1698.7 ms, System: 6.8 ms]
Range (min … max): 607.9 ms … 612.7 ms 20 runs
Benchmark 2: _build/default/domain_queue.exe 40
Time (mean ± σ): 874.2 ms ± 60.2 ms [User: 1714.2 ms, System: 5.7 ms]
Range (min … max): 776.0 ms … 1025.0 ms 20 runs
Benchmark 3: rust/target/release/tq 40
Time (mean ± σ): 469.0 ms ± 79.5 ms [User: 1318.1 ms, System: 3.0 ms]
Range (min … max): 381.5 ms … 740.2 ms 20 runs
Summary
'rust/target/release/tq 40' ran
1.30 ± 0.22 times faster than '_build/default/mp_queue.exe 40'
1.86 ± 0.34 times faster than '_build/default/domain_queue.exe 40'
I don’t have any other hardware/OS to test on. But so far, neither MacOS vs Linux, x86_64 vs aarch64 explain the difference in behaviour.