My mental model says “probably yes?”, though I worry I’ll be baking in some really difficult to track down bugs so I thought it worth a chat here.
Obviously you can’t sprinkle magic multicore pixie dust onto Lwt apps and make them automatically parallel because Lwt applications lean heavily on not being task switched between Lwt endpoints. Though I assume I can at least add some parallelism to my Lwt applications like so:
let run_on_separate_core f =
Lwt_preemptive.detach (fun () ->
let d = Domain.spawn (fun _ -> f ()) in
Domain.join d) ()
let other_thing () =
(* ... *)
try%lwt
let%lwt res = run_on_separate_core my_task in
printf "exited with result: %d\n" res
with
| e -> failwithf !"my_task threw exception: %{Exn}" e ()
(* ... *)
This will start a POSIX thread separate from Lwt, which normally doesn’t get you much compute parallelism because of the OCaml runtime lock, but in this case we transfer f into its own domain and that will run in parallel. When it’s done, the result transfers back into the first domain and then gets bundled up into a fulfilled Lwt promise.
The calling thread can learn about outcome as a normal fulfilled Lwt.t promise (or exception).
Yes, it seems that this pattern will work correctly for the normal case – Neat.
However, what will happen if a user tries to cancel the promise returned by Lwt_premptive.detach ? This will prevent any promises chained to the promise returned from detach, which is great but the the domain thread will keep running. We want the domain thread to be stopped somehow.
Looking at Lwt manual (see “Cancellable Promises”) you might need to attach a on_cancel handler to the promise returned by detach and somehow interrupt the domain d. I don’t know how interrupting domains works in Multicore OCaml though…
P.S. Another issue is that there is a fixed pool of premptive threads that is used for detach. I think it is some low fixed number. These premptive threads will be quickly exhausted if you start many jobs in the above manner. We should be able to Domain.spawn a large number of times but it seems to me we will be limited in this case by the Lwt preemption pool. Lwt experts – any opinions?
We could try drilling down into the POSIX thread knobs. There’s a Thread.kill but a warning says not to go there.
val kill : t -> unit
[@@ocaml.deprecated "Not implemented, do not use"]
(** This function was supposed to terminate prematurely the thread
whose handle is given. It is not currently implemented due to
problems with cleanup handlers on many POSIX 1003.1c implementations.
It always raises the [Invalid_argument] exception. *)
Lest this thread (ha) imply there are no solutions to this, there are higher-level abstractions provided. We could use Domainslib to create a Task pool, and there’s even work on an Lwt_domains (https://github.com/ocsigen/lwt/pull/860) that provides a nicer version of what I’m doing above.
My top-most example is just to help me get started offloading work from my main app thread right away with bare minimum re-engineering.
Though I haven’t tested this, I assume that this will work in Async apps as well:
let async_run_on_separate_core f =
In_thread.run (fun () ->
let d = Domain.spawn (fun _ -> f ()) in
Domain.join d)
You’re rebuilding the Domainslib logic here, as it’s not efficient to constantly be spawning domains. Please see the Lwt_domains PR on the Lwt issue tracker for a version of this that uses Domainslib to maintain a pool of CPUs. Comments/tests welcome on that issue.
Domainslib actually doesn’t seem to have the logic I need.
I need to start a compute thread and run it entirely separately from the original application thread. The compute thread will run forever and if it rejoins it’s an error[1]. The current Domainslib interface doesn’t seem to give a way to do this easily; your Task pool will include the parent domain if you use it naively.[2]
To make this work with Domainslib, it seems I would have to spawn a new domain, then inside of that create a pool (with ~num_additional_domains:0), and then set up some kind of communication channel for sending jobs to it.
Not using fork() here because I need to share some written state. Also fork on Windows blech.
I think this would break the (de facto) invariant that Lwt/Async thunks don’t have to worry about re-entrance? Developers can generally write code and order their data updates without worrying about interleaving. Whereas running two async functions at the same time might go boom if they update shared state, which isn’t normally a concern on the current one-thunk-at-a-time scheduler.
yes it would, in practice tho I tried in the past and it actually did work with opium, which was kind of nice, 4 domains led to 2x throughput improvement and IIRC 10x improvement in latency.
distributing executions on multiple cores is parallelism
Lwt/Async are libraries for concurrency
These are separate features. However, there is something interesting to note. The thread was created specifically about multicore (i.e., parallelism). But OCaml 5.0 has two main features: parallelism and effects! And as far as Lwt/Async is concerned, effects are somewhat more important to consider.
In essence, Lwt/Async provides a user-space, very limited subset of the effects feature, a subset specialised for executing I/O and scheduling callbacks on those.
There are two things we can do to make better use of the new OCaml features in Lwt/Async. (There are probably more, suggestions welcome.)
Extend the specialised subset of events to support “detach this computation on another core”. This is essentially what @sudha 's work with Lwt_domain does.
This allows the users to make better use of their CPU.
Re-implement the internal scheduler of Lwt/Async using actual first-class effects and handlers.
This has the potential to be more efficient (because it uses first-class features of the language that the compiler “understands” better), and to allow users to conjointly use Lwt/Async with other libraries which make use of events.
The two points actually synergies too: the second makes the first easier to implement.
Sorry to be mentioning it again but, as this thread demonstrates, we’re desperately missing some developer oriented general purpose documentation on these new features y’all.
I know it’s hard for all of us to stop the coding fun and write doc but I really don’t want to have to go through other people’s implementations to figure out inner implementation details.