[mystery solved] Compiling with continuations: flambda and performance

Chet_Murthy · October 16, 2020, 7:21pm

Aha. Needed a little more care with the placing of abstractions for continuations.

Erasing post, b/c now it all works as desired.

Restoring the deleted post:

I was recently chewing over monadic concurrency, and got to wondering what the performance difference is, between writing in direct style, and writing in the -simplest- kind of monadic style I could imagine: just continuations in tail position.

So I wrote a benchmark: event/tests/test_ab.ml at master · chetmurthy/event · GitHub

It’s a simple test with two modes: “direct” and “kont”. In both modes, it loops thru a buffer, fetching a character and bitwise-or-ing it into an accumulator (to simulate -using- the fetched byte, and hopefully forcing the optimizer and hardware to wait for that fetch). And I’m using the 4.11.1+flambda compiler, with “-O3 -unbox-closures”, so hopefully I’m getting excellent inlining and allocation-avoidance.

And yet, direct-style is 5x faster than CPS on this (admittedly micro)benchmark. I wonder why … It’s been decades since I went down into the assembler … does anybody have any ideas on what might be going wrong here?
./test_ab.opt direct 1024000
258MiB/sec: 1024000 read in 0.003792 secs

./test_ab.opt kont 1024000
50MiB/sec: 1024000 read in 0.019679 secs
P.S. I guess, for such a simple benchmark, I was hoping that the compiler could inline enough to discover the loop. Ah, well.

c-cube · October 16, 2020, 8:55pm

In my experience, even for algorithmic code that is more naturally expressed in an imperative way (say, with a lot of integer arrays), loops beat direct tail-recursive functions, even without any CPS encoding. ie. while … or for … with local references tends to be faster than let rec loop …. That makes me a bit sad because OCaml doesn’t offer a lot of imperative control flow so the code tends to be uglier this way.

keigoi · October 16, 2020, 9:44pm

Could you please recover the original post? I remember that it described flambda’s optimisation of a (continuation?) monad. I am also interested in optimising function monads like this (e.g. a state monad
for linearity checking https://github.com/keigoi/linocaml ). Thanks!

Chet_Murthy · October 16, 2020, 9:51pm

Argh, I’m sorry, I didn’t save a copy, and I don’t know if this discussion board software has a way to recover deleted text. But in any case, the benchmark I wrote is here: https://github.com/chetmurthy/event/blob/master/tests/test_ab.ml

I’m experimenting, trying to figure out the most-efficient form of monadic I/O, for this simplest example, before moving on to more-complex things.

dbuenzli · October 16, 2020, 9:56pm

Recovered it from my emails in your original post.

vlaviron · October 16, 2020, 10:04pm

For your information: your CPS code is bad because it allocates closures all the time. To prevent this, you should eta-expand your definitions:

(* bad: [read1 n k] first computes [read1 n], then applies the result to [k] *)
let read1 n : int Kont.comp =
  Kont.return (Char.code (Bytes.get buffer (n mod 1024)))

(* good: [Kont.return] can be inlined, and [read1 n k] is a single direct application *)
let read1 n k =
  Kont.return (Char.code (Bytes.get buffer (n mod 1024))) k

This is particularly important for the recursive readn, which can be specialised for a given continuation if eta-expanded but not otherwise.
So small experiments on my side show read speeds of around 1GB/s both with direct and cps styles (with flambda, and with some patches to speed up the loop bodies themselves).

Chet_Murthy · October 16, 2020, 10:05pm

Yes, this is what I figured-out. AKA: “writing in CPS form is hard to get right”. grin

Levi_Roth · October 16, 2020, 10:22pm

In the upper right-hand corner of the post, by the timestamp, there’s a little orange pencil icon showing that it was edited. If you click that, you can step through the edit history to see previous versions.

Topic		Replies	Views
[BLOG] Flambda2 Ep. 4: How to write a purely functional compiler, by OCamlPro Learning compiler , blog , optimization , cps , flambda2	6	534	May 30, 2025
[BLOG] Flambda2 Ep. 3: Speculative Inlining, by OCamlPro Learning blog , flambda , flambda2 , inlining	0	184	September 5, 2024
[BLOG] Flambda2 Ep. 2: Loopifying Tail-Recursive Functions, by OCamlPro Learning compiler , blog , flambda , optimization , flambda2	3	466	June 10, 2024
threading/IO monad vs threads: the case of web-app servers Ecosystem lwt , web , monads , threading	37	3583	December 20, 2020
Minisat-ml: a reimplementation of minisat in OCaml Community performance	2	1230	April 10, 2019

[mystery solved] Compiling with continuations: flambda and performance

Related topics