Significant performance difference between OCaml and F#

Is it worth noting that this shows how this shootout stuff is generally useless?

11 Likes

On the other hand, this thread was very interesting and educative to follow, so there may be some collateral benefits coming from those benchmark games :blush:

18 Likes

I think it’s actually really useful to document how one takes a naively-written solution a problem, and optimizes it to get it as close to the best available, as possible. Because every now and then in our code, we have such a problem we want to optimize, and knowing the techniques that have worked for others is valuable. I’d rather shamelessly plagiarize than brilliantly invent, thankyouverymuch.

I was thinking, as this thread evolved, that it’d be lovely to do this for all the problems in the shootout where ocaml lagged.

10 Likes

I got tired of thinking about memory buffers and cache misses, and decided to optimize for code readability. Behold my most beautiful version of (efficient) pidigits so far.



let extract_digit =
  let tmp1 : Mpz.m Mpz.tt = Mpz.init () in
  let tmp2 : Mpz.m Mpz.tt = Mpz.init () in
  fun (num,den,acc) nth ->
   (* joggling between tmp1 and tmp2,
      so that GMP does not use temporary buffers *)
    Mpz.mul_si tmp1 num nth;
    Mpz.add tmp2 acc tmp1;
    Mpz.tdiv_q tmp1 tmp2 den;
    Mpz.get_int tmp1 (* (nth * num + acc) / den |> to_int *)


let eliminate_digit (num, den, acc) d =
  Mpz.mul_si num num 10;   (* num <- num * 10 *)
  Mpz.submul_ui acc den d; 
  Mpz.mul_si acc acc 10 (* acc <- 10 * (acc - den * d) *)

let next_term (num, den, acc) k =
  let k2 = k * 2 + 1 in
  Mpz.mul_si den den k2; (* den <- den * k2 *)
  Mpz.addmul_ui acc num 2;
  Mpz.mul_si acc acc k2; (* acc <- (acc + 2*num) * k2 *)
  Mpz.mul_si num num k (* num <- num*k *)

let rec terms z k =
  fun () -> 
  next_term z k;
  Seq.Cons(z, terms z (k + 1))

let digit z =
  let d = extract_digit z 3 in
  if extract_digit z 4 = d
  then (eliminate_digit z d; Some d)
  else None

let all_digits =
  terms (Mpz.of_int 1, Mpz.of_int 1, Mpz.of_int 0) 1
  |> Seq.filter (fun (num, _den, acc) -> Mpz.cmp num acc <= 0)
  |> Seq.filter_map digit

let columns = 10
let rec print_digits pos n digits =
  if pos mod columns = 0 && pos > 0 then Printf.printf "\t:%i\n" pos;
  if pos = n then ()
  else match digits () with
  | Seq.Nil ->
    if pos mod columns > 0 then begin
      print_char ' ';
      print_digits (pos + 1) n digits
    end
  | Seq.Cons (d, rest) ->
    print_int d;
    print_digits (pos + 1) n rest

let () =
  let n = try int_of_string (Array.get Sys.argv 1) with _ -> 27 in  
  print_digits 0 n all_digits

(… now I have to go propose Seq.split_at for the standard library.)

6 Likes

My personal take-away is that it is not necessarily very difficult to improve the performance of OCaml solutions substantially, plus it’s actually fun and possibly useful for the language communication. And it’s a shame that we don’t have more people doing this! We analyzed the pidigits benchmarks five days ago and this suggested an easy way to get large speedups for this benchmark, yet it seems that nobody went for it before I ran with it yesterday. Of the many people showing interest in Shootout benchmarks, I wish more people were contributing.

7 Likes

That is most certainly readable, but the reason why your code actually works is awfully magical. I would not even dare to formally verify it. The code looks very beautiful, very functional, with terms iteratively producing triples of numbers, and with digits consuming them to produce the final result. But that is a blatant lie, as digits is modifying the internal state of terms under its feet. It is very hard to predict what the sequence produced by terms actually looks like. To prove that the code is correct, you would need either one of these two properties:

  1. You know the implementation details of Seq. For example, you know that it is strictly lazy, that is, it does not evaluate terms faster than its output is consumed by digits. So, you know the precise interleaving of the calls to terms and digits and you can use it in a proof.
  2. You know that digits is a left group action while terms is a right group action over the shared state. So, the actual interleaving of the calls to terms and digits does not change the final sequence, despite changing the intermediate sequences. That would be a challenging proof. (Once done, you have almost proved that the code actually computes the digits of pi.)

For me what you call “strict lazyness” is not an implementation detail but part of the specification of Seq (which I agree is very challenging to formalize). In particular, I would expect that it is possible to prove that the interleaving of next_term and eliminate_digit is the same as with a more direct imperative loop.

I agree that this is difficult to formalize, but I believe that it has been done before. For example, the Iterator module of the Mezzo standard library closes over some existentially-hidden non-duplicable state (here the state would be the ownership of the mpz triple), and each element is “borrowed” from the iterator, one has to give the permission back to get the next element. This guarantees what you call “strict lazyness” purely from a separation in the style of separation logic.

(Mezzo’s iterator are not exactly analogous to OCaml’s Seq because the consumer is in control of termination, they can call stop on the iterator to recover a postcondition, while with Seq the postcondition is only made available when the sequence is finished, which is controlled by the producer. But they are very close, and I could have implemented this as a Mezzo iterator as well – they have filter in the standard library, but lack filter_map and flat_map.)

Note: in the process of writing this version, I went through a version that had a recursive loop with exactly the structure of the simple C solution, using recursion instead of continue.

let rec digits k i n =
  if i = n then ()
  else
    let () = next_term k in
    let k = k + 1 in
    if should_skip ()
    then digits k i n
    else
      let d = extract_digit 3 in
      if extract_digit 4 <> d
      then digits k i n
      else
        let () = print_int d in
        let i = i + 1 in
        if i mod 10 = 0 then
          Printf.printf "\t:%i\n" i;
        let () = eliminate_digit d in
        digits k i n

I also wrote the following version that separates the digit computation from the printing, producing an infinite stream of digits.

let rec digits z k =
  if should_skip z then
    skip_digit z k
  else
    let d = extract_digit z 3 in
    if extract_digit z 4 = d
    then keep_digit z k d
    else skip_digit z k
and skip_digit z k =
  next_term z z k;
  digits z (k + 1)
and keep_digit z k d =
  (fun () ->
     eliminate_digit z z d;
     Seq.Cons (d, digits z k))

I would expect that one can show that the sequence of numbers produced by the two versions are identical.

3 Likes

Which is why I think that in general, outside of massive and obvious performance advantages (such as Rust vs OCaml vs Python), you’re really just getting mostly noise. Performance is proportional to the size of the respective communities, and therefore to the number of people who are available to optimize the benchmarks.

Rust, Julia, Ada…

These languages have decently large communities—perhaps larger than OCaml’s, but it’s pretty crazy that Rust is competing with C and C++, and Julia is beating C# and creeping up on Fortran. I have no doubt that this is in part attributable to better benchmark programs, but I do think it says something in general about the susceptibility to optimization of different languages—especially in cases were languages with relatively small communities compete with the, eh, “titans of industry”.

It seems to me that Rust should be competing well with C/C++. Think about the memory model of Rust: it’s explicitly designed so that you can write your code with same sort of memory-lifetime/ownership as in well-written C++, and it’s compiled (IIRC) using LLVM, so should give the same oppties for optimization as C++.

1 Like

Yes, exactly! This is why I think there is more to these benchmarks than “he who has the most users wins”. Good contributors are needed to expose the opportunities for optimization, but those opportunities have to be there in the first place.

Ideally, when LLVM evolves to account for rust’s existence, rust should
be generally faster than C++ (clang), because it offers stronger aliasing
guarantees :-). Currently afaik these optimizations are disabled because
they trigger bugs in lesser exercized code paths of LLVM.

2 Likes

That’s why I said

You’re right that it’s not just the size of the community though. If a language gets lucky and happens to have a fanatic who’s obsessed with optimizing the benchmarks for his own particular language, you’ll tend to get improvements for that language. Still just noise though.

1 Like

The OCaml pidigits benchmark seems to be broken in OCaml 4.11. E.g. see https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/ocaml-ghc.html

P.S. Thanks for your efforts on the shootout game. It’s true that these benchmarks can be quite useless to understand the intrinsic performance of the language (e.g. witness some of the very unidiomatic solutions in Haskell for instance).

However, there are a lot of eyeballs looking at these benchmarks unfortunately and people do make choices based on them…

3 Likes

Thanks, I tried to notify the benchmark maintainer at https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/issues/335#note_189205.

Edit: the issue was promptly fixed by the benchmarks maintainer, Isaac Gouy.

2 Likes

I tested regex-redux with ocaml-re out of curiosity. The duration goes from 11.5 to 7 seconds on my laptop. The difference is quite appreciable. Has anyone tried to suggest a version using Re?

2 Likes

You can see recent (<= 3 years) OCaml submissions in the issue tracker. They seem to suggest that the only OCaml submission tracked currently is “#2”, the one using Str directly. Feel free to propose a faster version using ocaml-re!

It looks like all “fast” solutions are using bindings to PCRE (either pcre or pcre2, but the “jit” version). Have you considered trying Markus Mottl’s pcre-ocaml library, which seems to support PCRE’s JIT implementation?

(The “fast” solutions are also performing the count tasks in parallel with the slower replace task. I think this could be done relatively easily by using fork after reading the input.)

I tested pcre-ocaml but couldn’t find how to use the JIT implementation. I also tested the solutions by parallelizing the execution with a fork.
Here are the results I get (in seconds) :

Impl sequential parallel
Str 12 7
Pcre 14 9
Re.Str 7 6
Re 3,4 2
5 Likes

This is very nice! Have you considered submitting the “parallel Re” version upstream?

(I tried to look at how to explicitly request “jit support” in ocaml-pcre, or even check that it is available, but I did not find how to do it; there seems to be some logic in the C stubs but no clear way to activate it from the build system. I guess @mmottl would know more.)

3 Likes

I have submitted the parallel Re version upstream however, if I misused Pcre I can test it in the correct way.

2 Likes