[mystery solved] Compiling with continuations: flambda and performance

Aha. Needed a little more care with the placing of abstractions for continuations.

Erasing post, b/c now it all works as desired.

Restoring the deleted post:

I was recently chewing over monadic concurrency, and got to wondering what the performance difference is, between writing in direct style, and writing in the -simplest- kind of monadic style I could imagine: just continuations in tail position.

So I wrote a benchmark: event/tests/test_ab.ml at master · chetmurthy/event · GitHub

It’s a simple test with two modes: “direct” and “kont”. In both modes, it loops thru a buffer, fetching a character and bitwise-or-ing it into an accumulator (to simulate -using- the fetched byte, and hopefully forcing the optimizer and hardware to wait for that fetch). And I’m using the 4.11.1+flambda compiler, with “-O3 -unbox-closures”, so hopefully I’m getting excellent inlining and allocation-avoidance.

And yet, direct-style is 5x faster than CPS on this (admittedly micro)benchmark. I wonder why … It’s been decades since I went down into the assembler … does anybody have any ideas on what might be going wrong here?

./test_ab.opt direct 1024000
258MiB/sec: 1024000 read in 0.003792 secs

./test_ab.opt kont 1024000
50MiB/sec: 1024000 read in 0.019679 secs

P.S. I guess, for such a simple benchmark, I was hoping that the compiler could inline enough to discover the loop. Ah, well.

In my experience, even for algorithmic code that is more naturally expressed in an imperative way (say, with a lot of integer arrays), loops beat direct tail-recursive functions, even without any CPS encoding. ie. while … or for … with local references tends to be faster than let rec loop …. That makes me a bit sad because OCaml doesn’t offer a lot of imperative control flow so the code tends to be uglier this way.

Could you please recover the original post? I remember that it described flambda’s optimisation of a (continuation?) monad. I am also interested in optimising function monads like this (e.g. a state monad
for linearity checking https://github.com/keigoi/linocaml ). Thanks!

1 Like

Argh, I’m sorry, I didn’t save a copy, and I don’t know if this discussion board software has a way to recover deleted text. But in any case, the benchmark I wrote is here: https://github.com/chetmurthy/event/blob/master/tests/test_ab.ml

I’m experimenting, trying to figure out the most-efficient form of monadic I/O, for this simplest example, before moving on to more-complex things.

Recovered it from my emails in your original post.

For your information: your CPS code is bad because it allocates closures all the time. To prevent this, you should eta-expand your definitions:

(* bad: [read1 n k] first computes [read1 n], then applies the result to [k] *)
let read1 n : int Kont.comp =
  Kont.return (Char.code (Bytes.get buffer (n mod 1024)))

(* good: [Kont.return] can be inlined, and [read1 n k] is a single direct application *)
let read1 n k =
  Kont.return (Char.code (Bytes.get buffer (n mod 1024))) k

This is particularly important for the recursive readn, which can be specialised for a given continuation if eta-expanded but not otherwise.
So small experiments on my side show read speeds of around 1GB/s both with direct and cps styles (with flambda, and with some patches to speed up the loop bodies themselves).

2 Likes

Yes, this is what I figured-out. AKA: “writing in CPS form is hard to get right”. grin

In the upper right-hand corner of the post, by the timestamp, there’s a little orange pencil icon showing that it was edited. If you click that, you can step through the edit history to see previous versions.