Significant performance difference between OCaml and F#

Someone from Hacker News posted the performance comparison between F# and OCaml on some artificial tests. So I thought maybe some of them are overlooked and decided to post it here:

pidigits

source seconds memory gz busy cpu load
F# 1.87 35,692 905 2.01 4% 2% 4% 98%
OCaml 10.96 19,464 458 11.79 4% 1% 100% 2%

k-nucleotide

source seconds memory gz busy cpu load
F# 5.80 183,360 1907 19.47 86% 66% 88% 96%
OCaml 21.32 255,548 1833 59.79 91% 48% 100% 42%

spectral-norm

source seconds memory gz busy cpu load
F# 4.29 35,228 853 16.12 92% 92% 97% 94%
OCaml 15.70 3,704 377 16.86 100% 3% 5% 0%

fasta

source seconds memory gz busy cpu load
F# 1.70 138,596 1350 6.45 95% 98% 93% 93%
OCaml 6.00 199,124 1189 6.32 100% 1% 1% 4%

regex-redux

source seconds memory gz busy cpu load
F# 7.87 971,632 599 11.18 64% 37% 28% 13%
OCaml 25.23 901,944 637 27.09 4% 1% 100% 3%

mandelbrot

source seconds memory gz busy cpu load
F# 5.78 65,644 933 22.72 99% 98% 98% 98%
OCaml 14.02 5,048 717 55.89 99% 100% 100% 100%

binary-trees

source seconds memory gz busy cpu load
F# 6.28 724,124 635 22.76 91% 93% 90% 88%
OCaml 9.89 152,992 751 28.64 49% 64% 98% 79%

reverse-complement

source seconds memory gz busy cpu load
F# 2.93 1,031,696 1140 8.52 94% 45% 60% 90%
OCaml 3.74 33,944 1368 8.90 62% 58% 60% 59%

fannkuch-redux

source seconds memory gz busy cpu load
F# 16.68 35,816 1097 65.52 98% 97% 100% 98%
OCaml 16.37 19,584 1004 65.39 100% 100% 100% 100%

n-body

source seconds memory gz busy cpu load
F# 22.20 33,660 1290 23.03 0% 100% 1% 3%
OCaml 21.67 1,368 1251 23.26 3% 1% 100% 4%

Used software

F# .NET Core SDK 3.1.201
Host Version: 3.1.3; Commit: 4a9f85e9f8
<ServerGarbageCollection>true
<ConcurrentGarbageCollection>true

OCaml

The OCaml native-code compiler, version 4.10.0

https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/fsharp.html

2 Likes

In most cases with a large difference, the F# version is parallelized (or using C external libraries) and compared to a single-process OCaml version. Unsurprisingly different algorithms yield different performance characteristics.

2 Likes

That’s only true for three (spectral-norm, fasta, regex-redux) of the seven benchmarks with a significant performance difference. For the k-nucleotide, mandelbrot, and binary-tree benchmarks, both implementations are parallelized, but F# is able to make better use of that parallelism.

For the pidigits benchmark, the F# code uses GMP through its FFI and the OCaml code uses it through Zarith. I’m a little surprised by the performance difference in this one, since the implementations seem quite similar.

1 Like

The pidigits comparison has come up before as well. I get similar numbers to what’s shared in the other post, which suggests that there could’ve been an issue with how the benchmark was run in the original post.

It’s worth pointing out that there are multicore versions of most of these benchmarks available as part of the Sandmark benchmarking infrastructure if someone fancies seeing how they compare with the F# ones: https://github.com/ocaml-bench/sandmark/tree/master/benchmarks (You may need to do a little rummaging)

One caveat is that these benchmarks were written for the purpose of comparing OCaml runtimes to OCaml runtimes rather than other language implementations - so there might still be performance left on the table.

I think that it may require quite a bit more work to get “better performances” on Shootout benchmarks by using the Multicore runtime. I looked at it a few weeks back and my conclusion are that:

  • For many parallel benchmarks, a solution that reduces cross-thread communication to a very minimum has been proposed in some language, and this solution can be ported to OCaml efficiently using Unix.fork. A multicore version will not typically be faster than a fork-using version, because it pays the cost of a concurrent runtime.

  • There are other benchmarks that require delicate inter-thread communication and scheduling, and thus cannot be implemented easily with Unix.fork. There I would expect Multicore to be better, a substantial improvement over sequential OCaml benchmarks. But such fine-grained concurrency may also benefit a lot from a good concurrent scheduler (typically some kind of work-threading scheduler), as available in battle-tested concurrent runtimes in other languages (eg. dotnet, JVM or Haskell). If I understand correctly, this part has not been a focus of recent Multicore developments, so it’s not there yet. Thus we can expect Multicore to not do quite as well as those runtimes on this class of examples.

I would recommend playing with Multicore OCaml and shootout benchmarks, it’s a lot of fun and you learn a lot (I played with binary-trees), but I would not necessarily expect to beat our fine competitors on Shootout benchmarks today.

7 Likes

The implementation is completely different. The F# code directly uses the C interface, while the OCaml code uses ZArith, which is a wrapper around it. In particular, the F# code provides an “entity” semantics to GMP, while ZArith provides a “value” semantics. This makes a huge difference with respect to memory usage. The OCaml code allocates memory like crazy, while the F# code works in place with perfect cache locality.

Compare the following two codes. OCaml does (nth * num + acc), while F# does

mpzMulUi(&tmp1, &num, nth)
mpzAdd(&tmp2, &tmp1, &acc)

Sure, the OCaml code is much more readable, but the F# code controls where numbers are stored. (More precisely, GMP controls where they are stored, but the F# code tells GMP that the buffers associated to tmp1 and tmp2 can be reused. No need to allocate new buffers, contrarily to ZArith.)

2 Likes

Just to be sure, I did experiment a bit. With respect to last-level cache misses, there are four orders of magnitude between the ZArith wrapper and direct access to GMP primitives. One suffers from 200M cache misses, while the other only gets 17k. Another relevant data is that one triggers 900 full compactions from the garbage collector, while the other does not even trigger a single minor collection. So, there is no way the OCaml code would ever be competitive with the implementations for other languages. In fact, I am surprised that it is that fast given the awful memory locality.

(Just to be on the safe side, I also added explicit flushes to the OCaml code, so that both programs have the same behavior with respect to I/O buffers. The results with respect to cache locality are almost the same.)

6 Likes

Regarding pidigits, this analysis suggests two dimensions for improvement:

  • implement fused operations (eg. multiply-add) in Zarith (and other pure interfaces) to save on intermediate allocations, and how that one can get a good boost on specific benchmarks from that
  • implement a simple low-level interface that exposes the mutation-in-place interface instead of the pure interface to GMP operations.

My understanding is that Zarith is not designed for speed on large-number benchmarks, it is designed to do well on programs that mostly manipulate machine-size numbers but may go above that, and for fast conversion to and form OCaml integers when the sizes allow (in the fast path, it’s a no-op). Looking at the implementation.

My impression is that it would be a substantial amount of work to add a mutation-in-place interface to Zarith, whereas adapting a simpler C bindings (without the no-op conversion to and from OCaml integers), such as gmp-ocaml, could be much easier. (One could even build the pure interface on top of the in-place interface to avoid code duplication.)

2 Likes

As far as I can tell, there already exist some shallow bindings to GMP:

Documentation: https://www.inrialpes.fr/pop-art/people/bjeannet/mlxxxidl-forge/mlgmpidl/html/Mpz.html

1 Like

Ah, so this is “just” a matter of rewriting the benchmark to use this library. Any takers?

(I thought that the Shootout website would not allow external libraries in benchmarks, but Zarith was allowed; I don’t know if a solution using mlgmpidl would be accepted; worth the try anyway!).

I never learned to use CamlIDL myself, but it does look quite convenient for this particular library (there is little code with not much boilerplate to describe a low-level binding). I guess nowadays people would recommend Ctypes instead, which is fine as well.

Edit: in fact there also is a Ctypes binding to the low-level Gmp functions (that exposes the imperative style); it is not gmp-ocaml that I linked to earlier, it is ocaml-gmp (see the Z_raw module in gmp.ml, unfortunately not currently exposed to the outside.)
It sounds like there are a bit too many gmp bindings out there…

2 Likes

Curiously, in five tests Haskell (GHC) significantly faster than OCaml too:

mandelbrot

source secs mem gz busy cpu load
OCaml 14.02 5,048 717 55.89 99% 100% 100% 100%
Haskell GHC 5.06 37,660 1975 20.09 98% 100% 100% 100%

spectral-norm

source secs mem gz busy cpu load
OCaml 15.70 3,704 377 16.86 100% 3% 5% 0%
Haskell GHC 4.09 4,184 987 16.03 98% 99% 99% 97%

fasta

source secs mem gz busy cpu load
OCaml 6.00 199,124 1189 6.32 100% 1% 1% 4%
Haskell GHC 1.40 8,020 1882 4.16 74% 74% 74% 75%

pidigits

Haskell implementation here uses GMP though as well.

source secs mem gz busy cpu load
OCaml 10.96 19,464 458 11.79 4% 1% 100% 2%
Haskell GHC 1.76 6,516 1694 1.79 0% 1% 1% 99%

regex-redux

Haskell implementation uses PCRE2 library though.

source secs mem gz busy cpu load
OCaml 25.23 901,944 637 27.09 4% 1% 100% 3%
Haskell GHC 1.68 308,212 2213 3.92 46% 77% 44% 66%

Regarding pidigits, the reason is the same as for F#. The Haskell code directly calls the GMP functions, so cache locality is perfect, contrarily to OCaml/ZArith. This makes the code completely unidiomatic, since GMP functions are inherently effectful. Thus, the Haskell code uses seq all over the place to make sure computations are executed in the correct order.

2 Likes

These benchmarks results can mostly be explained by parallel implementations in Haskell versus sequential in OCaml, and differences in external libraries: GMP and PCRE2. The exceptions seems to be mandelbrot.

Pidigits: I rewrote the OCaml solution 5 to use the low-level operations exposed by the mlgmpidl library.

The running time goes from 5.6s to 1.4s on my machine.

Diff:

-let big_3  = Z.of_int 3
-let big_4  = Z.of_int 4
-let big_10 = Z.of_int 10

+let init = (Mpz.of_int 1, Mpz.of_int 1, Mpz.of_int 0) 
-let init = (Z.one, Z.one, Z.zero)

-let extract (num,den,acc) nth = Z.((nth * num + acc) / den |> to_int)
+let extract =
+  let res : Mpz.m Mpz.tt = Mpz.init () in
+  fun (num,den,acc) nth ->
+    (* (nth * num + acc) / den |> to_int *)
+    Mpz.mul_si res num nth;
+    Mpz.add res res acc;
+    Mpz.tdiv_q res res den;
+    Mpz.get_int res

-let next z = extract z big_3
+let next z = extract z 3

-let safe z n = extract z big_4 = n
+let safe z n = extract z 4 = n

-let prod (num,den,acc) d =
-  Z.(big_10 * num,
-     den,
-     big_10 * (acc - den * of_int d))
+let prod =
+  let tmp : Mpz.m Mpz.tt = Mpz.init () in
+  fun (res_num, res_den, res_acc) (num,den,acc) d ->
+    (* (10 * num, den, 10 * (acc - den * of_int d)) *)
+    Mpz.mul_si res_num num 10;
+
+    Mpz.set res_den den;
+
+    Mpz.mul_si tmp den d;
+    Mpz.sub res_acc acc tmp;
+    Mpz.mul_si res_acc res_acc 10;
+
+    ()

-let cons (num,den,acc) k =
-  let k2 = Z.of_int (k * 2 + 1) in
-  Z.(of_int k * num,
-     k2 * den,
-     k2 * (acc + num + num))
+let cons =
+  fun (res_num, res_den, res_acc) (num,den,acc) k ->
+    let k2 = k * 2 + 1 in
+    (*  (k * num, k2 * den,  k2 * (acc + num + num)) *)
+    Mpz.mul_si res_den den k2;
+
+    Mpz.add res_acc acc num;
+    Mpz.add res_acc res_acc num;
+    Mpz.mul_si res_acc res_acc k2;
+
+    (* must go below the computation of res_acc,
+       to avoid trashing the 'num' value it uses *)
+    Mpz.mul_si res_num num k;
+    
+    ()
 
 let columns = 10
 
@@ -39,12 +60,17 @@
     if col = columns then begin
       let row = row + col in
       Printf.printf "\t:%i\n%i" row d;
-      digit k (prod z d) (n-1) row 1
+      prod z z d;
+      digit k z (n-1) row 1
     end else begin 
       print_int d;
-      digit k (prod z d) (n-1) row (col+1)
+      prod z z d;
+      digit k z (n-1) row (col+1)
     end
-  else digit (k+1) (cons z k) n row col
+  else begin
+    cons z z k;
+    digit (k+1) z n row col
+  end
 
 let digits n = digit 1 init n 0 0

I will see if I can submit this on the Shootout website. I don’t know what the rules are for depending on third-party packages. (Edit: submitted as https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/issues/335.)

6 Likes

Your statement Mpz.tdiv_q res res den forces GMP to allocate fresh transient buffers rather than reusing existing long-lived ones. Notice how the C code (and all the other programs based on this algorithm) explicitly uses two buffers for this function: mpz_tdiv_q(tmp1, tmp2, den).

You can replace Mpz.mul_si tmp den d; Mpz.sub res_acc acc tmp by Mpz.submul_ui acc den d to avoid trashing the cache because of the tmp buffer.

The same issue exists with Mpz.mul_si tmp den d; Mpz.sub res_acc acc tmp. Unfortunately, since you have lost the information that res_acc is always acc, you cannot replace it by Mpz.submul_ui acc den d.

EDIT: To be clear, mpz_tdiv_q(tmp1, tmp2, den) needs a transient buffer too. But when written that way, GMP can avoid wasting time copying between buffers.

EDIT2: You could replace Mpz.mul_si tmp den d; Mpz.sub res_acc acc tmp with Mpz.set res_acc acc; Mpz.submul_ui res_acc den d. That would still perform a useless inplace copy, but at least that would no longer trash the cache.

1 Like

I used the following diff based on your advice:

 let extract =
+  let tmp : Mpz.m Mpz.tt = Mpz.init () in
   let res : Mpz.m Mpz.tt = Mpz.init () in
   fun (num,den,acc) nth ->
     (* (nth * num + acc) / den |> to_int *)
-    Mpz.mul_si res num nth;
-    Mpz.add res res acc;
-    Mpz.tdiv_q res res den;
+    Mpz.set tmp acc;
+    Mpz.addmul_ui tmp num nth;
+    Mpz.tdiv_q res tmp den;
     Mpz.get_int res
 
[...]

 let prod =
-  let tmp : Mpz.m Mpz.tt = Mpz.init () in
   fun (res_num, res_den, res_acc) (num,den,acc) d ->
     (* (10 * num, den, 10 * (acc - den * of_int d)) *)
     Mpz.mul_si res_num num 10;
 
-    Mpz.set res_den den;
+    if res_den != den then Mpz.set res_den den;
 
-    Mpz.mul_si tmp den d;
-    Mpz.sub res_acc acc tmp;
+    if res_acc != acc then Mpz.set res_acc acc;
+    Mpz.submul_ui res_acc den d;
     Mpz.mul_si res_acc res_acc 10;
 
     ()

Without the aliasing optimization (if res_foo != foo), the performance of the new version is essentially the same as my version on my machine, despite the lower instruction count. With the aliasing optimizations, there is a very small improvement in performance, typically 1.35s instead of 1.4s.

On the other hand, a further optimization provided a larger speedup:

 let extract =
-  let tmp : Mpz.m Mpz.tt = Mpz.init () in
-  let res : Mpz.m Mpz.tt = Mpz.init () in
+  let tmp1 : Mpz.m Mpz.tt = Mpz.init () in
+  let tmp2 : Mpz.m Mpz.tt = Mpz.init () in
   fun (num,den,acc) nth ->
     (* (nth * num + acc) / den |> to_int *)
-    Mpz.set tmp acc;
-    Mpz.addmul_ui tmp num nth;
-    Mpz.tdiv_q res tmp den;
-    Mpz.get_int res
+    Mpz.mul_si tmp1 num nth;
+    Mpz.add tmp2 acc tmp1;
+    Mpz.tdiv_q tmp1 tmp2 den;
+    Mpz.get_int tmp1

This goes from 1.35s to 1.25s.

Final version:

let extract =
  let tmp1 : Mpz.m Mpz.tt = Mpz.init () in
  let tmp2 : Mpz.m Mpz.tt = Mpz.init () in
  fun (num,den,acc) nth ->
    (* (nth * num + acc) / den |> to_int *)
    Mpz.mul_si tmp1 num nth;
    Mpz.add tmp2 acc tmp1;
    Mpz.tdiv_q tmp1 tmp2 den;
    Mpz.get_int tmp1

let prod =
  fun (res_num, res_den, res_acc) (num,den,acc) d ->
    (* (10 * num, den, 10 * (acc - den * of_int d)) *)
    Mpz.mul_si res_num num 10;

    if res_den != den then Mpz.set res_den den;

    if res_acc != acc then Mpz.set res_acc acc;
    Mpz.submul_ui res_acc den d;
    Mpz.mul_si res_acc res_acc 10;

    ()

let cons =
  fun (res_num, res_den, res_acc) (num,den,acc) k ->
    let k2 = k * 2 + 1 in
    (*  (k * num, k2 * den,  k2 * (acc + num + num)) *)
    Mpz.mul_si res_den den k2;

    Mpz.add res_acc acc num;
    Mpz.add res_acc res_acc num;
    Mpz.mul_si res_acc res_acc k2;
3 Likes

My first submission is already online on the benchmark website. On the benchmark machine it is measured at 2.88s, against 1.87s for the F# version (which seems to be doing some low-level byte-shifting stuff).

2 Likes

The number of last-level cache misses for the OCaml version is now close to zero, great. However, the number of first-level cache misses is still 50% too large. In fact, the number of memory accesses themselves is 50% too large. We can recover some performances by replacing

    Mpz.add res_acc acc num;
    Mpz.add res_acc res_acc num;

with

    if res_acc != acc then Mpz.set res_acc acc;
    Mpz.addmul_ui res_acc num 2;

But this is still nowhere near the optimal memory patterns. I am really confused as to what might be causing this huge overhead.

Do you mean the incTotal function? I would not worry about that. I very much doubt I/O is the bottleneck here.

I finally found what was wrong. The OCaml code does not implement the correct algorithm. Or rather, some important part of the algorithm is missing. The digit function should start as follows:

let rec digit k z n row col =
  if n = 0 then Printf.printf "%*s\t:%i\n%!" (columns-col) "" (row+col) else
  if let (num, _, acc) = z in Mpz.cmp num acc > 0 then begin
      cons z z k;
      digit (k + 1) z n row col
    end else
  ...

Now the OCaml version is competitive with the C version.

10 Likes