Performance of the new effects in comparison to monads

My reading of the effects paper is that in addition to improving on the programmability, that the effects should be strictly equal to or higher performance than the equivalent monadic code (modulo things like flambda). With the primary reason being that it can avoid allocating closures etc.

My microbenchmark is at: prototyping-testing/main.ml at 37f5d35ea6f644339bfb6426e47d6b4b1de0b93b Β· Cjen1/prototyping-testing Β· GitHub

My use case is a state machine for a distributed system where on each iteration it receives a list of incoming messages and emits a list of messages to send.

In the benchmark I am testing just the emitting of the messages to send by looping to send all the β€˜messages’ (integers here from n β†’ 0). There is also a (minimal?) variation which does the same using a reference.

I suspect that what is happening is a tradeoff between unrolling the stack (for the effect system) and the allocations for the monad, but it is also highly likely that I have missed something obvious!

(also just to point out that the effects system not being quite as fast as for example the monadic approach means nothing when you consider that you get a very very good way of writing generic effectful code without having to deal with coloured functions).

("Starting test" (n 10))
Estimated testing time 40s (4 benchmarks x 10s). Change using '-quota'.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name   β”‚ Time/Run β”‚ mWd/Run β”‚ Percentage β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ EffFun β”‚ 353.02ns β”‚ 225.00w β”‚    100.00% β”‚
β”‚ EffRef β”‚ 199.26ns β”‚ 178.01w β”‚     56.45% β”‚
β”‚ Monads β”‚  91.73ns β”‚ 234.00w β”‚     25.98% β”‚
β”‚ SimRef β”‚  34.06ns β”‚  36.00w β”‚      9.65% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

("Starting test" (n 1000))
Estimated testing time 40s (4 benchmarks x 10s). Change using '-quota'.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name   β”‚ Time/Run β”‚ mWd/Run β”‚ mjWd/Run β”‚ Prom/Run β”‚ Percentage β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ EffFun β”‚  30.64us β”‚ 20.03kw β”‚  116.31w β”‚  116.31w β”‚    100.00% β”‚
β”‚ EffRef β”‚  16.37us β”‚ 15.03kw β”‚  173.63w β”‚  173.63w β”‚     53.43% β”‚
β”‚ Monads β”‚  10.10us β”‚ 22.01kw β”‚  126.35w β”‚  126.35w β”‚     32.97% β”‚
β”‚ SimRef β”‚   3.72us β”‚  3.01kw β”‚   34.39w β”‚   34.39w β”‚     12.13% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

("Starting test" (n 10000))
Estimated testing time 40s (4 benchmarks x 10s). Change using '-quota'.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name   β”‚ Time/Run β”‚  mWd/Run β”‚ mjWd/Run β”‚ Prom/Run β”‚ Percentage β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ EffFun β”‚ 415.01us β”‚ 200.03kw β”‚  11.47kw β”‚  11.47kw β”‚    100.00% β”‚
β”‚ EffRef β”‚ 333.82us β”‚ 150.04kw β”‚  17.19kw β”‚  17.19kw β”‚     80.44% β”‚
β”‚ Monads β”‚ 230.45us β”‚ 220.02kw β”‚  12.59kw β”‚  12.59kw β”‚     55.53% β”‚
β”‚ SimRef β”‚  70.04us β”‚  30.01kw β”‚   3.43kw β”‚   3.43kw β”‚     16.88% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
5 Likes

With effects I think you only need to pay the β€œcost” of the effect when the effect is actually triggered (e.g. on a syscall, although I assume for network syscalls you could also perform them optimistically and rely on EAGAIN to trigger the effect, but I haven’t checked how EIO is implemented).

With monads you have to pay that cost all the time, because each computation and function return type needs to be wrapped in the monad, which can result in the allocation of a lot of closures, especially when a value is waiting on the return of a syscall (some more short-lived than other, although Lwt has some optimizations here when the result is immediately known I think it can still build up chains of deferred values when the Lwt.t result is not immediately available).

So a more realistic comparisons of concurrency monads vs effects would be to measure both paths:

  • immediate value propagation where the monad, or the compiler via flambda can optimize away some of the overhead

  • suspended/blocked values, which might trigger the construction of a lot of closures to be run when the value is available (probably best to measure the Lwt and Async implementation here directly, because they’ll be a lot more complicated than simple ones, especially that they might also have to deal with cancelation, etc. and they may also have some optimizations,
    e.g. I think Lwt uses mutation internally to avoid some pathological cases
    ). This depends on the depth of the call-stack (so e.g. try 10-30 binds after the β€œsleeping” monad value), but also if you happen to fold_left bind on a list the chains can be much much longer.

I think effects can avoid the latter problem altogether, and see Reflection without Remorse for a description of the problem (it affects Haskell more because it doesn’t have mutation, and has to resort to other data structures to recover the performance cost with long monad chains, I think Lwt achieves a similar optimization through the use of mutation, but a naively implemented monad would have the performance cost)

3 Likes

Sorry if this seems out of place, but what’s this Caml.Effect module you open? I don’t see it in the standard library nor find any reference to it on the web.

They (probably) use janestreet libs, which reexport stdlib as caml.

Yep using Core requires the reexport (afaik).

Granted, but I don’t see an Effect module in the stdlib reference at https://v2.ocaml.org/manual/stdlib.html

Yep it gets added in OCaml 5.0.

1 Like

Oh sorry, I keep forgetting that 5.0 isn’t the official release yet. :wink: (I’m still on 4.13.1 but I have my eye on 5.x for Eio & friends. Didn’t think effects would be in that release; I thought they were for much later. I must be confusing β€œeffects” vs β€œtyped effects”.)

Indeed, there hasn’t been much (visibly) going on for effect typing in the compiler, but the runtime mechanism for effect handlers (delimited control support, new garbage collector, etc…) has been in the works for the last few years and has successfully landed on 5.0.0

Delimited Continuations vs Lwt for Threads | MirageOS seems like a relevant article.

1 Like

That post is quite old and focuses on user-level delimited continuations from the honestly excellent delimcc library. Effects’ delimited continuations are first-class in the runtime and limited to linear mode of usage… I imagine that gives them different perf characteristics?

There was a blog post where the authors of angstrom observed perf improvements when they switched from lwt to effects, let me see if I can find it…