Is it worth noting that this shows how this shootout stuff is generally useless?
On the other hand, this thread was very interesting and educative to follow, so there may be some collateral benefits coming from those benchmark games
I think itâs actually really useful to document how one takes a naively-written solution a problem, and optimizes it to get it as close to the best available, as possible. Because every now and then in our code, we have such a problem we want to optimize, and knowing the techniques that have worked for others is valuable. Iâd rather shamelessly plagiarize than brilliantly invent, thankyouverymuch.
I was thinking, as this thread evolved, that itâd be lovely to do this for all the problems in the shootout where ocaml lagged.
I got tired of thinking about memory buffers and cache misses, and decided to optimize for code readability. Behold my most beautiful version of (efficient) pidigits so far.
let extract_digit =
let tmp1 : Mpz.m Mpz.tt = Mpz.init () in
let tmp2 : Mpz.m Mpz.tt = Mpz.init () in
fun (num,den,acc) nth ->
(* joggling between tmp1 and tmp2,
so that GMP does not use temporary buffers *)
Mpz.mul_si tmp1 num nth;
Mpz.add tmp2 acc tmp1;
Mpz.tdiv_q tmp1 tmp2 den;
Mpz.get_int tmp1 (* (nth * num + acc) / den |> to_int *)
let eliminate_digit (num, den, acc) d =
Mpz.mul_si num num 10; (* num <- num * 10 *)
Mpz.submul_ui acc den d;
Mpz.mul_si acc acc 10 (* acc <- 10 * (acc - den * d) *)
let next_term (num, den, acc) k =
let k2 = k * 2 + 1 in
Mpz.mul_si den den k2; (* den <- den * k2 *)
Mpz.addmul_ui acc num 2;
Mpz.mul_si acc acc k2; (* acc <- (acc + 2*num) * k2 *)
Mpz.mul_si num num k (* num <- num*k *)
let rec terms z k =
fun () ->
next_term z k;
Seq.Cons(z, terms z (k + 1))
let digit z =
let d = extract_digit z 3 in
if extract_digit z 4 = d
then (eliminate_digit z d; Some d)
else None
let all_digits =
terms (Mpz.of_int 1, Mpz.of_int 1, Mpz.of_int 0) 1
|> Seq.filter (fun (num, _den, acc) -> Mpz.cmp num acc <= 0)
|> Seq.filter_map digit
let columns = 10
let rec print_digits pos n digits =
if pos mod columns = 0 && pos > 0 then Printf.printf "\t:%i\n" pos;
if pos = n then ()
else match digits () with
| Seq.Nil ->
if pos mod columns > 0 then begin
print_char ' ';
print_digits (pos + 1) n digits
end
| Seq.Cons (d, rest) ->
print_int d;
print_digits (pos + 1) n rest
let () =
let n = try int_of_string (Array.get Sys.argv 1) with _ -> 27 in
print_digits 0 n all_digits
(⌠now I have to go propose Seq.split_at
for the standard library.)
My personal take-away is that it is not necessarily very difficult to improve the performance of OCaml solutions substantially, plus itâs actually fun and possibly useful for the language communication. And itâs a shame that we donât have more people doing this! We analyzed the pidigits benchmarks five days ago and this suggested an easy way to get large speedups for this benchmark, yet it seems that nobody went for it before I ran with it yesterday. Of the many people showing interest in Shootout benchmarks, I wish more people were contributing.
That is most certainly readable, but the reason why your code actually works is awfully magical. I would not even dare to formally verify it. The code looks very beautiful, very functional, with terms
iteratively producing triples of numbers, and with digits
consuming them to produce the final result. But that is a blatant lie, as digits
is modifying the internal state of terms
under its feet. It is very hard to predict what the sequence produced by terms
actually looks like. To prove that the code is correct, you would need either one of these two properties:
- You know the implementation details of
Seq
. For example, you know that it is strictly lazy, that is, it does not evaluateterms
faster than its output is consumed bydigits
. So, you know the precise interleaving of the calls toterms
anddigits
and you can use it in a proof. - You know that
digits
is a left group action whileterms
is a right group action over the shared state. So, the actual interleaving of the calls toterms
anddigits
does not change the final sequence, despite changing the intermediate sequences. That would be a challenging proof. (Once done, you have almost proved that the code actually computes the digits of pi.)
For me what you call âstrict lazynessâ is not an implementation detail but part of the specification of Seq
(which I agree is very challenging to formalize). In particular, I would expect that it is possible to prove that the interleaving of next_term
and eliminate_digit
is the same as with a more direct imperative loop.
I agree that this is difficult to formalize, but I believe that it has been done before. For example, the Iterator module of the Mezzo standard library closes over some existentially-hidden non-duplicable state (here the state would be the ownership of the mpz triple), and each element is âborrowedâ from the iterator, one has to give the permission back to get the next element. This guarantees what you call âstrict lazynessâ purely from a separation in the style of separation logic.
(Mezzoâs iterator are not exactly analogous to OCamlâs Seq
because the consumer is in control of termination, they can call stop
on the iterator to recover a postcondition, while with Seq
the postcondition is only made available when the sequence is finished, which is controlled by the producer. But they are very close, and I could have implemented this as a Mezzo iterator as well â they have filter
in the standard library, but lack filter_map
and flat_map
.)
Note: in the process of writing this version, I went through a version that had a recursive loop with exactly the structure of the simple C solution, using recursion instead of continue
.
let rec digits k i n =
if i = n then ()
else
let () = next_term k in
let k = k + 1 in
if should_skip ()
then digits k i n
else
let d = extract_digit 3 in
if extract_digit 4 <> d
then digits k i n
else
let () = print_int d in
let i = i + 1 in
if i mod 10 = 0 then
Printf.printf "\t:%i\n" i;
let () = eliminate_digit d in
digits k i n
I also wrote the following version that separates the digit computation from the printing, producing an infinite stream of digits.
let rec digits z k =
if should_skip z then
skip_digit z k
else
let d = extract_digit z 3 in
if extract_digit z 4 = d
then keep_digit z k d
else skip_digit z k
and skip_digit z k =
next_term z z k;
digits z (k + 1)
and keep_digit z k d =
(fun () ->
eliminate_digit z z d;
Seq.Cons (d, digits z k))
I would expect that one can show that the sequence of numbers produced by the two versions are identical.
Which is why I think that in general, outside of massive and obvious performance advantages (such as Rust vs OCaml vs Python), youâre really just getting mostly noise. Performance is proportional to the size of the respective communities, and therefore to the number of people who are available to optimize the benchmarks.
Rust, Julia, AdaâŚ
These languages have decently large communitiesâperhaps larger than OCamlâs, but itâs pretty crazy that Rust is competing with C and C++, and Julia is beating C# and creeping up on Fortran. I have no doubt that this is in part attributable to better benchmark programs, but I do think it says something in general about the susceptibility to optimization of different languagesâespecially in cases were languages with relatively small communities compete with the, eh, âtitans of industryâ.
It seems to me that Rust should be competing well with C/C++. Think about the memory model of Rust: itâs explicitly designed so that you can write your code with same sort of memory-lifetime/ownership as in well-written C++, and itâs compiled (IIRC) using LLVM, so should give the same oppties for optimization as C++.
Yes, exactly! This is why I think there is more to these benchmarks than âhe who has the most users winsâ. Good contributors are needed to expose the opportunities for optimization, but those opportunities have to be there in the first place.
Ideally, when LLVM evolves to account for rustâs existence, rust should
be generally faster than C++ (clang), because it offers stronger aliasing
guarantees :-). Currently afaik these optimizations are disabled because
they trigger bugs in lesser exercized code paths of LLVM.
Thatâs why I said
Youâre right that itâs not just the size of the community though. If a language gets lucky and happens to have a fanatic whoâs obsessed with optimizing the benchmarks for his own particular language, youâll tend to get improvements for that language. Still just noise though.
The OCaml pidigits benchmark seems to be broken in OCaml 4.11. E.g. see https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/ocaml-ghc.html
P.S. Thanks for your efforts on the shootout game. Itâs true that these benchmarks can be quite useless to understand the intrinsic performance of the language (e.g. witness some of the very unidiomatic solutions in Haskell for instance).
However, there are a lot of eyeballs looking at these benchmarks unfortunately and people do make choices based on themâŚ
Thanks, I tried to notify the benchmark maintainer at https://salsa.debian.org/benchmarksgame-team/benchmarksgame/-/issues/335#note_189205.
Edit: the issue was promptly fixed by the benchmarks maintainer, Isaac Gouy.
I tested regex-redux with ocaml-re out of curiosity. The duration goes from 11.5 to 7 seconds on my laptop. The difference is quite appreciable. Has anyone tried to suggest a version using Re?
You can see recent (<= 3 years) OCaml submissions in the issue tracker. They seem to suggest that the only OCaml submission tracked currently is â#2â, the one using Str
directly. Feel free to propose a faster version using ocaml-re
!
It looks like all âfastâ solutions are using bindings to PCRE (either pcre or pcre2, but the âjitâ version). Have you considered trying Markus Mottlâs pcre-ocaml library, which seems to support PCREâs JIT implementation?
(The âfastâ solutions are also performing the count tasks in parallel with the slower replace task. I think this could be done relatively easily by using fork
after reading the input.)
I tested pcre-ocaml but couldnât find how to use the JIT implementation. I also tested the solutions by parallelizing the execution with a fork.
Here are the results I get (in seconds) :
Impl | sequential | parallel |
---|---|---|
Str | 12 | 7 |
Pcre | 14 | 9 |
Re.Str | 7 | 6 |
Re | 3,4 | 2 |
This is very nice! Have you considered submitting the âparallel Reâ version upstream?
(I tried to look at how to explicitly request âjit supportâ in ocaml-pcre, or even check that it is available, but I did not find how to do it; there seems to be some logic in the C stubs but no clear way to activate it from the build system. I guess @mmottl would know more.)
I have submitted the parallel Re version upstream however, if I misused Pcre I can test it in the correct way.