Blog post: A new general computation model for Dune

build
dune

#1


A blog post about the new computation model that Dune is built on.


#2

Why is the memoize call explicit? One could naively think that any 'b Fiber.t represents an incremental 'b value, that is cached/memoized (… Fiber.t would be a parallel+incremental effect type). I suppose that you get a benefit from having the programmer control where the memoization points are placed?


#3

My understanding is that you need to give the input/output ops to specify explicitly how to hash/serialize your values, so that memoization works across calls to dune.


#4

Very cool! Do you think such a memoization system might have uses other than inside a build system?

Also I see you mentioned Build systems à la carte; where would Dune be in the scheduling algorithm/rebuilding strategy table?


#5

@gasche, there are several reasons. The first reason is that memoization is not free and adds a bit of overhead. In particular, you need to use more memory in order to track the dependencies and cached result and you also spend time checking whether previous results are still valid. If you memoize everything, then the overhead of the memoization system is likely to become too big and will slow down everything.

By controlling explicitly what is memoized, we have better control on the ratio between the overhead of memoization and the amount of things that are recomputed when certain things change. In particular, we just need to make sure that the amount of recomputation needed for common modificatoins is small enough to not be noticeable. It is not necessary to go any further than that.

The second reason is indeed so that we can make cached results persist between calls to dune, as @c-cube mentioned. Plus we also have operations to parse/print the inputs to provide additional debugging.

The third reason is a bit less obvious. In order to recompute as little as possible, it is important to consider as few dependencies as possible. If we allow to memoize things at any point, then we might capture computed values inside closures, which means that the memoization system has no choice but to consider all the transitive dependencies. In systems such as skip where memoization is part of the language, it is probably possible to be clever and detect what is being captured, however we can’t do that with a normal library. To avoid this problem in Dune, we enforce that only top-level functions can be memoized. i.e. once the build starts it is not possible to memoize any new function. Programmers have to manually eta expand closures they would want to memoize, explicitely specifying the inputs of the function to the system at the same time.


#6

@steinuil, I’ll need to re-read the paper, but I believe Dune is currently a suspending with verifying traces build system. However, we started experimenting with a shared artifact cache for Dune, which means that it would become a suspending with constructive traces build system. The initial results on this front are very promosing BTW.


#7

I suppose yes. I don’t have specific applications in mind, but it is quite a general system.


#8

If you make the memoization implicit in the Fiber.t monad you also break the monad laws in a meaningful way. For example,

let a = ... in
let b = map a f in
let c = map b g in
...

is not equivalent to:

let a = ... in
let c = map a (fun x -> g (f x)) in
...

because in the first the call to f x is memoized and in the second it is not.

This prevents a number of important optimisations. For example, you can no longer treat:

map (both a (both b (both c (both d e)))) f

as:

map5 a b c d e f

since they would have different memoization properties. This essentially the same reason why using let%bind/let* syntax is not optimal for async.

In practice, between the loss of optimisations and the tendency to produce much more fine-grained memoization than is efficient, this style of monad has noticeably worse performance than one where the memoization is explicit.


#9

@lpw25: isn’t this is a common consequence of having non-tracked
(unsafe) effects in monadic programs? As soon as your monad may
duplicate, erase or even reorder computations, you can observe
law-breaking behaviors if the programs building monadic values are not
pure.

In this case this is not even a monadic law that is being broken, but
functoriality. It is well-understood that, for example,

List.map f >> List.map g
List.map (f >> g)

are not equivalent in OCaml for arbitrary f, g, but I’m not sure
this is an argument against using lists to represent non-determinism.

I’m not trying to contradict what you say, obviously you have experience with incremental-computation monads, but genuinely curious. Why is the interaction between untracked effects and memoization a “meaningful” issue that prevents optimizations?


#10

I always struggle to articulate the issue here, and I don’t think I choose a good example of it.

I think there are a few reasons that this case is worse than the list case:

a) Memoization doesn’t just change the behaviour for the untracked effects, it also often changes the behaviour for the effects which are being tracked by the monad itself. For example, if foo and bar are two of the tracked effects:

foo >>= fun x ->
bar (x mod 5)

and

foo >>= fun x ->
return (x mod 5) >>= fun y ->
bar y

can disagree in whether the bar effect gets re-executed or not when the value of x has changed but the value of x mod 5 has remained the same.

b) Unlike the list case, here it is the monadic primitives themselves that are performing untracked side effects. Even in a case like async, where there is sharing but no cut-offs, you get that:

let x = read_file () in
x >>= fun file1 ->
x >>= fun file2 ->
file1, file2

is different from:

read_file () >>= fun file1 ->
read_file () >>= fun file2 ->
file1, file2

Essentially, you can no longer think of the monad as being a pure representation of a computation that is then executed.

c) The reason that we have sharing or memoization in these monads is that we care about those things. Since it is a side-effect that we care about and wish to reason about, the monad should be tracking it just like the other effects in the monad. When you make this explicit it becomes much clearer that we are drastically breaking the monad laws, since our structural operations (map, bind, etc.) are performing effects.

d) Monads with sharing do not seem to be naturally modeled using algebraic effects, which implies that they are in some way non-algebraic. I’ve not got a clear handle on this issue but it definitely bothers me.

Separately from that I would also say that I do consider the issues with list as an argument for not using list directly to represent non-determinism. Were I making a library for non-determinism I would probably lean towards something like unit -> 'a list or a free monad instead.

I feel like these arguments point towards some deeper issue that I don’t have a clean explanation for, but they are enough to convince me that it is not the right approach. Especially combined with the improved performance and reasoning you can get by making memoization/sharing an explicitly tracked effect within the monad.