Significant performance difference between OCaml and F#

XVilka · June 15, 2020, 11:22am

Someone from Hacker News posted the performance comparison between F# and OCaml on some artificial tests. So I thought maybe some of them are overlooked and decided to post it here:

pidigits

source	seconds	memory	gz	busy	cpu load
F#	1.87	35,692	905	2.01	4% 2% 4% 98%
OCaml	10.96	19,464	458	11.79	4% 1% 100% 2%

k-nucleotide

source	seconds	memory	gz	busy	cpu load
F#	5.80	183,360	1907	19.47	86% 66% 88% 96%
OCaml	21.32	255,548	1833	59.79	91% 48% 100% 42%

spectral-norm

source	seconds	memory	gz	busy	cpu load
F#	4.29	35,228	853	16.12	92% 92% 97% 94%
OCaml	15.70	3,704	377	16.86	100% 3% 5% 0%

fasta

source	seconds	memory	gz	busy	cpu load
F#	1.70	138,596	1350	6.45	95% 98% 93% 93%
OCaml	6.00	199,124	1189	6.32	100% 1% 1% 4%

regex-redux

source	seconds	memory	gz	busy	cpu load
F#	7.87	971,632	599	11.18	64% 37% 28% 13%
OCaml	25.23	901,944	637	27.09	4% 1% 100% 3%

mandelbrot

source	seconds	memory	gz	busy	cpu load
F#	5.78	65,644	933	22.72	99% 98% 98% 98%
OCaml	14.02	5,048	717	55.89	99% 100% 100% 100%

binary-trees

source	seconds	memory	gz	busy	cpu load
F#	6.28	724,124	635	22.76	91% 93% 90% 88%
OCaml	9.89	152,992	751	28.64	49% 64% 98% 79%

reverse-complement

source	seconds	memory	gz	busy	cpu load
F#	2.93	1,031,696	1140	8.52	94% 45% 60% 90%
OCaml	3.74	33,944	1368	8.90	62% 58% 60% 59%

fannkuch-redux

source	seconds	memory	gz	busy	cpu load
F#	16.68	35,816	1097	65.52	98% 97% 100% 98%
OCaml	16.37	19,584	1004	65.39	100% 100% 100% 100%

n-body

source	seconds	memory	gz	busy	cpu load
F#	22.20	33,660	1290	23.03	0% 100% 1% 3%
OCaml	21.67	1,368	1251	23.26	3% 1% 100% 4%

Used software

F# .NET Core SDK 3.1.201
Host Version: 3.1.3; Commit: 4a9f85e9f8
<ServerGarbageCollection>true
<ConcurrentGarbageCollection>true

OCaml

The OCaml native-code compiler, version 4.10.0

https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/fsharp.html

octachron · June 15, 2020, 12:40pm

In most cases with a large difference, the F# version is parallelized (or using C external libraries) and compared to a single-process OCaml version. Unsurprisingly different algorithms yield different performance characteristics.

jfeser · June 15, 2020, 3:04pm

That’s only true for three (spectral-norm, fasta, regex-redux) of the seven benchmarks with a significant performance difference. For the k-nucleotide, mandelbrot, and binary-tree benchmarks, both implementations are parallelized, but F# is able to make better use of that parallelism.

For the pidigits benchmark, the F# code uses GMP through its FFI and the OCaml code uses it through Zarith. I’m a little surprised by the performance difference in this one, since the implementations seem quite similar.

anuragsoni · June 15, 2020, 3:07pm

The pidigits comparison has come up before as well. I get similar numbers to what’s shared in the other post, which suggests that there could’ve been an issue with how the benchmark was run in the original post.

sadiq · June 15, 2020, 4:00pm

It’s worth pointing out that there are multicore versions of most of these benchmarks available as part of the Sandmark benchmarking infrastructure if someone fancies seeing how they compare with the F# ones: https://github.com/ocaml-bench/sandmark/tree/master/benchmarks (You may need to do a little rummaging)

One caveat is that these benchmarks were written for the purpose of comparing OCaml runtimes to OCaml runtimes rather than other language implementations - so there might still be performance left on the table.

gasche · June 15, 2020, 7:47pm

I think that it may require quite a bit more work to get “better performances” on Shootout benchmarks by using the Multicore runtime. I looked at it a few weeks back and my conclusion are that:

For many parallel benchmarks, a solution that reduces cross-thread communication to a very minimum has been proposed in some language, and this solution can be ported to OCaml efficiently using Unix.fork. A multicore version will not typically be faster than a fork-using version, because it pays the cost of a concurrent runtime.
There are other benchmarks that require delicate inter-thread communication and scheduling, and thus cannot be implemented easily with Unix.fork. There I would expect Multicore to be better, a substantial improvement over sequential OCaml benchmarks. But such fine-grained concurrency may also benefit a lot from a good concurrent scheduler (typically some kind of work-threading scheduler), as available in battle-tested concurrent runtimes in other languages (eg. dotnet, JVM or Haskell). If I understand correctly, this part has not been a focus of recent Multicore developments, so it’s not there yet. Thus we can expect Multicore to not do quite as well as those runtimes on this class of examples.

I would recommend playing with Multicore OCaml and shootout benchmarks, it’s a lot of fun and you learn a lot (I played with binary-trees), but I would not necessarily expect to beat our fine competitors on Shootout benchmarks today.

silene · June 16, 2020, 5:28am

The implementation is completely different. The F# code directly uses the C interface, while the OCaml code uses ZArith, which is a wrapper around it. In particular, the F# code provides an “entity” semantics to GMP, while ZArith provides a “value” semantics. This makes a huge difference with respect to memory usage. The OCaml code allocates memory like crazy, while the F# code works in place with perfect cache locality.

Compare the following two codes. OCaml does (nth * num + acc), while F# does

mpzMulUi(&tmp1, &num, nth)
mpzAdd(&tmp2, &tmp1, &acc)

Sure, the OCaml code is much more readable, but the F# code controls where numbers are stored. (More precisely, GMP controls where they are stored, but the F# code tells GMP that the buffers associated to tmp1 and tmp2 can be reused. No need to allocate new buffers, contrarily to ZArith.)

silene · June 16, 2020, 7:35am

Just to be sure, I did experiment a bit. With respect to last-level cache misses, there are four orders of magnitude between the ZArith wrapper and direct access to GMP primitives. One suffers from 200M cache misses, while the other only gets 17k. Another relevant data is that one triggers 900 full compactions from the garbage collector, while the other does not even trigger a single minor collection. So, there is no way the OCaml code would ever be competitive with the implementations for other languages. In fact, I am surprised that it is that fast given the awful memory locality.

(Just to be on the safe side, I also added explicit flushes to the OCaml code, so that both programs have the same behavior with respect to I/O buffers. The results with respect to cache locality are almost the same.)

gasche · June 16, 2020, 11:54am

Regarding pidigits, this analysis suggests two dimensions for improvement:

implement fused operations (eg. multiply-add) in Zarith (and other pure interfaces) to save on intermediate allocations, and how that one can get a good boost on specific benchmarks from that
implement a simple low-level interface that exposes the mutation-in-place interface instead of the pure interface to GMP operations.

My understanding is that Zarith is not designed for speed on large-number benchmarks, it is designed to do well on programs that mostly manipulate machine-size numbers but may go above that, and for fast conversion to and form OCaml integers when the sizes allow (in the fast path, it’s a no-op). Looking at the implementation.

My impression is that it would be a substantial amount of work to add a mutation-in-place interface to Zarith, whereas adapting a simpler C bindings (without the no-op conversion to and from OCaml integers), such as gmp-ocaml, could be much easier. (One could even build the pure interface on top of the in-place interface to avoid code duplication.)

silene · June 16, 2020, 2:59pm

As far as I can tell, there already exist some shallow bindings to GMP:

Documentation: https://www.inrialpes.fr/pop-art/people/bjeannet/mlxxxidl-forge/mlgmpidl/html/Mpz.html

gasche · June 16, 2020, 3:19pm

Ah, so this is “just” a matter of rewriting the benchmark to use this library. Any takers?

(I thought that the Shootout website would not allow external libraries in benchmarks, but Zarith was allowed; I don’t know if a solution using mlgmpidl would be accepted; worth the try anyway!).

I never learned to use CamlIDL myself, but it does look quite convenient for this particular library (there is little code with not much boilerplate to describe a low-level binding). I guess nowadays people would recommend Ctypes instead, which is fine as well.

Edit: in fact there also is a Ctypes binding to the low-level Gmp functions (that exposes the imperative style); it is not gmp-ocaml that I linked to earlier, it is ocaml-gmp (see the Z_raw module in gmp.ml, unfortunately not currently exposed to the outside.)
It sounds like there are a bit too many gmp bindings out there…

XVilka · June 17, 2020, 6:52am

Curiously, in five tests Haskell (GHC) significantly faster than OCaml too:

mandelbrot

source	secs	mem	gz	busy	cpu load
OCaml	14.02	5,048	717	55.89	99% 100% 100% 100%
Haskell GHC	5.06	37,660	1975	20.09	98% 100% 100% 100%

spectral-norm

source	secs	mem	gz	busy	cpu load
OCaml	15.70	3,704	377	16.86	100% 3% 5% 0%
Haskell GHC	4.09	4,184	987	16.03	98% 99% 99% 97%

fasta

source	secs	mem	gz	busy	cpu load
OCaml	6.00	199,124	1189	6.32	100% 1% 1% 4%
Haskell GHC	1.40	8,020	1882	4.16	74% 74% 74% 75%

pidigits

Haskell implementation here uses GMP though as well.

source	secs	mem	gz	busy	cpu load
OCaml	10.96	19,464	458	11.79	4% 1% 100% 2%
Haskell GHC	1.76	6,516	1694	1.79	0% 1% 1% 99%

regex-redux

Haskell implementation uses PCRE2 library though.

source	secs	mem	gz	busy	cpu load
OCaml	25.23	901,944	637	27.09	4% 1% 100% 3%
Haskell GHC	1.68	308,212	2213	3.92	46% 77% 44% 66%

silene · June 17, 2020, 11:22am

Regarding pidigits, the reason is the same as for F#. The Haskell code directly calls the GMP functions, so cache locality is perfect, contrarily to OCaml/ZArith. This makes the code completely unidiomatic, since GMP functions are inherently effectful. Thus, the Haskell code uses seq all over the place to make sure computations are executed in the correct order.

n4323 · June 18, 2020, 9:11pm

These benchmarks results can mostly be explained by parallel implementations in Haskell versus sequential in OCaml, and differences in external libraries: GMP and PCRE2. The exceptions seems to be mandelbrot.

gasche · June 20, 2020, 8:13pm

Pidigits: I rewrote the OCaml solution 5 to use the low-level operations exposed by the mlgmpidl library.

The running time goes from 5.6s to 1.4s on my machine.

Diff:

-let big_3  = Z.of_int 3
-let big_4  = Z.of_int 4
-let big_10 = Z.of_int 10

+let init = (Mpz.of_int 1, Mpz.of_int 1, Mpz.of_int 0) 
-let init = (Z.one, Z.one, Z.zero)

-let extract (num,den,acc) nth = Z.((nth * num + acc) / den |> to_int)
+let extract =
+  let res : Mpz.m Mpz.tt = Mpz.init () in
+  fun (num,den,acc) nth ->
+    (* (nth * num + acc) / den |> to_int *)
+    Mpz.mul_si res num nth;
+    Mpz.add res res acc;
+    Mpz.tdiv_q res res den;
+    Mpz.get_int res

-let next z = extract z big_3
+let next z = extract z 3

-let safe z n = extract z big_4 = n
+let safe z n = extract z 4 = n

-let prod (num,den,acc) d =
-  Z.(big_10 * num,
-     den,
-     big_10 * (acc - den * of_int d))
+let prod =
+  let tmp : Mpz.m Mpz.tt = Mpz.init () in
+  fun (res_num, res_den, res_acc) (num,den,acc) d ->
+    (* (10 * num, den, 10 * (acc - den * of_int d)) *)
+    Mpz.mul_si res_num num 10;
+
+    Mpz.set res_den den;
+
+    Mpz.mul_si tmp den d;
+    Mpz.sub res_acc acc tmp;
+    Mpz.mul_si res_acc res_acc 10;
+
+    ()

-let cons (num,den,acc) k =
-  let k2 = Z.of_int (k * 2 + 1) in
-  Z.(of_int k * num,
-     k2 * den,
-     k2 * (acc + num + num))
+let cons =
+  fun (res_num, res_den, res_acc) (num,den,acc) k ->
+    let k2 = k * 2 + 1 in
+    (*  (k * num, k2 * den,  k2 * (acc + num + num)) *)
+    Mpz.mul_si res_den den k2;
+
+    Mpz.add res_acc acc num;
+    Mpz.add res_acc res_acc num;
+    Mpz.mul_si res_acc res_acc k2;
+
+    (* must go below the computation of res_acc,
+       to avoid trashing the 'num' value it uses *)
+    Mpz.mul_si res_num num k;
+    
+    ()
 
 let columns = 10
 
@@ -39,12 +60,17 @@
     if col = columns then begin
       let row = row + col in
       Printf.printf "\t:%i\n%i" row d;
-      digit k (prod z d) (n-1) row 1
+      prod z z d;
+      digit k z (n-1) row 1
     end else begin 
       print_int d;
-      digit k (prod z d) (n-1) row (col+1)
+      prod z z d;
+      digit k z (n-1) row (col+1)
     end
-  else digit (k+1) (cons z k) n row col
+  else begin
+    cons z z k;
+    digit (k+1) z n row col
+  end
 
 let digits n = digit 1 init n 0 0

I will see if I can submit this on the Shootout website. I don’t know what the rules are for depending on third-party packages. (Edit: submitted as https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/issues/335.)

silene · June 21, 2020, 5:07am

Your statement Mpz.tdiv_q res res den forces GMP to allocate fresh transient buffers rather than reusing existing long-lived ones. Notice how the C code (and all the other programs based on this algorithm) explicitly uses two buffers for this function: mpz_tdiv_q(tmp1, tmp2, den).

You can replace Mpz.mul_si tmp den d; Mpz.sub res_acc acc tmp by Mpz.submul_ui acc den d to avoid trashing the cache because of the tmp buffer.

The same issue exists with Mpz.mul_si tmp den d; Mpz.sub res_acc acc tmp. Unfortunately, since you have lost the information that res_acc is always acc, you cannot replace it by Mpz.submul_ui acc den d.

EDIT: To be clear, mpz_tdiv_q(tmp1, tmp2, den) needs a transient buffer too. But when written that way, GMP can avoid wasting time copying between buffers.

EDIT2: You could replace Mpz.mul_si tmp den d; Mpz.sub res_acc acc tmp with Mpz.set res_acc acc; Mpz.submul_ui res_acc den d. That would still perform a useless inplace copy, but at least that would no longer trash the cache.

gasche · June 21, 2020, 8:58am

I used the following diff based on your advice:

 let extract =
+  let tmp : Mpz.m Mpz.tt = Mpz.init () in
   let res : Mpz.m Mpz.tt = Mpz.init () in
   fun (num,den,acc) nth ->
     (* (nth * num + acc) / den |> to_int *)
-    Mpz.mul_si res num nth;
-    Mpz.add res res acc;
-    Mpz.tdiv_q res res den;
+    Mpz.set tmp acc;
+    Mpz.addmul_ui tmp num nth;
+    Mpz.tdiv_q res tmp den;
     Mpz.get_int res
 
[...]

 let prod =
-  let tmp : Mpz.m Mpz.tt = Mpz.init () in
   fun (res_num, res_den, res_acc) (num,den,acc) d ->
     (* (10 * num, den, 10 * (acc - den * of_int d)) *)
     Mpz.mul_si res_num num 10;
 
-    Mpz.set res_den den;
+    if res_den != den then Mpz.set res_den den;
 
-    Mpz.mul_si tmp den d;
-    Mpz.sub res_acc acc tmp;
+    if res_acc != acc then Mpz.set res_acc acc;
+    Mpz.submul_ui res_acc den d;
     Mpz.mul_si res_acc res_acc 10;
 
     ()

Without the aliasing optimization (if res_foo != foo), the performance of the new version is essentially the same as my version on my machine, despite the lower instruction count. With the aliasing optimizations, there is a very small improvement in performance, typically 1.35s instead of 1.4s.

On the other hand, a further optimization provided a larger speedup:

 let extract =
-  let tmp : Mpz.m Mpz.tt = Mpz.init () in
-  let res : Mpz.m Mpz.tt = Mpz.init () in
+  let tmp1 : Mpz.m Mpz.tt = Mpz.init () in
+  let tmp2 : Mpz.m Mpz.tt = Mpz.init () in
   fun (num,den,acc) nth ->
     (* (nth * num + acc) / den |> to_int *)
-    Mpz.set tmp acc;
-    Mpz.addmul_ui tmp num nth;
-    Mpz.tdiv_q res tmp den;
-    Mpz.get_int res
+    Mpz.mul_si tmp1 num nth;
+    Mpz.add tmp2 acc tmp1;
+    Mpz.tdiv_q tmp1 tmp2 den;
+    Mpz.get_int tmp1

This goes from 1.35s to 1.25s.

Final version:

let extract =
  let tmp1 : Mpz.m Mpz.tt = Mpz.init () in
  let tmp2 : Mpz.m Mpz.tt = Mpz.init () in
  fun (num,den,acc) nth ->
    (* (nth * num + acc) / den |> to_int *)
    Mpz.mul_si tmp1 num nth;
    Mpz.add tmp2 acc tmp1;
    Mpz.tdiv_q tmp1 tmp2 den;
    Mpz.get_int tmp1

let prod =
  fun (res_num, res_den, res_acc) (num,den,acc) d ->
    (* (10 * num, den, 10 * (acc - den * of_int d)) *)
    Mpz.mul_si res_num num 10;

    if res_den != den then Mpz.set res_den den;

    if res_acc != acc then Mpz.set res_acc acc;
    Mpz.submul_ui res_acc den d;
    Mpz.mul_si res_acc res_acc 10;

    ()

let cons =
  fun (res_num, res_den, res_acc) (num,den,acc) k ->
    let k2 = k * 2 + 1 in
    (*  (k * num, k2 * den,  k2 * (acc + num + num)) *)
    Mpz.mul_si res_den den k2;

    Mpz.add res_acc acc num;
    Mpz.add res_acc res_acc num;
    Mpz.mul_si res_acc res_acc k2;

gasche · June 21, 2020, 9:05am

My first submission is already online on the benchmark website. On the benchmark machine it is measured at 2.88s, against 1.87s for the F# version (which seems to be doing some low-level byte-shifting stuff).

silene · June 21, 2020, 9:45am

The number of last-level cache misses for the OCaml version is now close to zero, great. However, the number of first-level cache misses is still 50% too large. In fact, the number of memory accesses themselves is 50% too large. We can recover some performances by replacing

    Mpz.add res_acc acc num;
    Mpz.add res_acc res_acc num;

with

    if res_acc != acc then Mpz.set res_acc acc;
    Mpz.addmul_ui res_acc num 2;

But this is still nowhere near the optimal memory patterns. I am really confused as to what might be causing this huge overhead.

Do you mean the incTotal function? I would not worry about that. I very much doubt I/O is the bottleneck here.

silene · June 21, 2020, 10:36am

I finally found what was wrong. The OCaml code does not implement the correct algorithm. Or rather, some important part of the algorithm is missing. The digit function should start as follows:

let rec digit k z n row col =
  if n = 0 then Printf.printf "%*s\t:%i\n%!" (columns-col) "" (row+col) else
  if let (num, _, acc) = z in Mpz.cmp num acc > 0 then begin
      cons z z k;
      digit (k + 1) z n row col
    end else
  ...

Now the OCaml version is competitive with the C version.

Topic		Replies	Views
OCaml benchmarks on different processor architectures? Ecosystem performance , benchmark	0	707	November 12, 2021
Ocaml Bytecode performance? Learning	21	2285	October 11, 2023
Looking at OCaml in the benchmarks game over the past few months Learning	20	4246	April 21, 2020
OCaml 5 performance Ecosystem multicore , performance , profiling , eio	30	2721	September 11, 2024
Blog: General thoughts on Ocaml & Haskell and OCaml's (supposedly) pathetic state of tooling Community opam , dune	51	8550	August 26, 2021

Significant performance difference between OCaml and F#

pidigits

k-nucleotide

spectral-norm

fasta

regex-redux

mandelbrot

binary-trees

reverse-complement

fannkuch-redux

n-body

Used software

mandelbrot

spectral-norm

fasta

pidigits

regex-redux

Related topics