OCaml 5; domains, Erlang, Beam, OTP

In my very limited understanding – Erlang newb here – what makes Erlang magical is having lightweight process that

  1. have per process heap (ultra low latency GC)
  2. can crash independently (helps w/ per process heap)
  3. can’t CPU hog (no iteration, only recursion, at every func call, checks how may total reductions done, force yields after limit)
  4. ability to monitor (receive notification when another process dies)

I believe other Erlang / Beam / OTP features are consequences of the above.

Question: With OCaml5 / domains, do we get anything close to “lightweight process” with above features ?

Somewhat of a followup to With domains, is anyone actively creating or updating their fault tolerant, let it crash like framework in OCaml?

2 Likes

Yes, the OCaml equivalent of that is Eio’s Fibers. They basically correspond to Erlang processes in the sense that they are both lightweight threads that run multiplexed on top of OCaml domains. Check out my post for a use case of fibers in a simple service: Practical OCaml, Multicore Edition - DEV Community

Eio has a supervision context similar to Erlang supervisor actors, called switches. Check out the documentation of the Eio.Switch module.

1 Like

I’ll also be releasing as part of a commercial SDK an Actor system built on top of Redis (for its distributed ecosystem), LWT and Capnproto. It will be OCaml 4.x for a variety of reasons. I will briefly talk about it at this month’s Houston Functional Programming User’s Group (https://hfpug.org/event/jonah-beckford-what-distributing-ocaml-on-windows-gave-me/) although that is not the main topic.

7 Likes

Sounds like a very interesting talk, do you know if it will be recorded?

1 Like

I C-f ed that post for “heap”; it does not appear. Can you please verify that eio fibers have their own heap (as Erlang processes do).

My understanding is that each domain can only run one fiber at a time; this makes me highly suspicious of each fiber having their own heap.

Also, if fibers can run until the hit IO, it means malicious / poorly written fibers can CPU hog with infinite loops; this is in contrast to Erlang processes which are force evicted after certain # of ‘reductions’.

Erlang’s magic comes from being limited in what you share between threads. This simplifies the design of an efficient GC (each thread can be GCed separately), and it simplifies the work to get back to a consistent shared state after an error.

OCaml 5 chose to have little restrictions on what you can share between threads. Having a GC simplifies the implementation of parallel data structures. Moreover OCaml 5 makes no distinction between local and shared state, the GC assumes that any value is in want of being shared between domains. Design decisions that follow from this choice, such as having a stop-the-world minor collector, work against goal 1.

As for 2., in the presence of shared state one must be careful about how one cleans up the mess after an error occurred. This asks for good resource-management features and a good story about exception-safety (see the example of Rust, which we have already discussed). Unfortunately, in OCaml you have to count on programmers’ discipline for this, which they have to discover the hard way because there is no consistent story being told about it. My understanding is that OCaml 5 was designed around a more defensive view of errors than Erlang’s let-it-crash. I do not see it (edit: the defensive approach) as realistic, and my past work on recovering from asynchronous exceptions at the OCaml workshop 2021 is relevant as it was also meant as a way to reverse this view (with the intuition that if one shows that it is possible to recover from asynchronous exceptions, then it also tells you to which extent you can recover from exceptions in general). On this topic, the following discussion about Eio is relevant: Understanding cancellation (in eio).

3 Likes

Sorry, are you saying:
(1) OCaml5’s defensive view is unrealistic or
(2) Erlang’s let it crash is unrealistic

I’m confused because in the next sentence, I’m not sure if you are saying: “I made progress on this problem” or “I have an reduction showing that if we could solve this problem, we an solve this other hard problem”

You are right, each Eio fiber does not get its own heap. In OCaml 5, each domain has a minor heap and they all share a major heap. Each fiber runs exclusively for a slice of time on a domain, during this time it has control of the heap. And yes, this means we have to be careful to move blocking operations out of the domains which are dedicated to running fibers. Sophisticated async I/O runtimes often maintain a dedicated worker pool for this, e.g. check out ZIO in Scala.

By contrast Eio is still at an early stage and needs adoption and feedback to grow a set of features oriented towards safety and ergonomics.

Sorry, I fixed it. I did not take the defensive view as realistic—all the more as many well-established OCaml software already make crucial use of asynchronous exceptions.

1 Like

Sorry to be pedantic. Are fibers pre-emptive or co-operative ? From what I’ve read online, fibers run until they call an effect, which I’m interpreting to mean they can infinite loop. However, I’m a newb with regards to this topic; this is just from what I’ve read.

Yes it will be recorded

5 Likes

I think to get pre-emption you can use N threads on M domains, and the OCaml runtime will ensure to periodically yield between threads running on a single domain: there is a dedicated ticker thread running every 50? ms that forces an interrupt on the domain.
It may not be bulletproof, I think in OCaml 4 the switching could only happen at points where the GC might run, but OCaml 5 extends that with poll points on loop backedges, etc. https://github.com/ocaml/ocaml/blob/ebc23f188f3cff679e6547d8d569ad5c0ef3de92/asmcomp/polling.ml#L91

IIUC fibers are fully cooperative, just like Lwt is: a lightweight thread that never calls ‘bind’ or a syscall wrapper can hog the CPU and no other fiber/Lwt lightweight thread would run: it is the programmer’s responsibility to insert calls to ‘yield’ in long running / computation heavy loops.

2 Likes

Very interested in what form the Actor system takes, especially compared to Akka/Scala and Erlang/OTP. Can you give a rough outline of what the architecture is and how you are using CapnP?

Please share the slides and talk afterwards.

1 Like

If you are dreaming about distributed programs, that of course will also scale well on a parallel computer: forget about ocaml-5 and start looking at the zmq library.

Sounds very interesting, looking forward to reading/hearing more about it!

I’m taking a complete stab in the dark here.

Is there an intuition here that

  1. most languages are “code first, data second” ; example: serializing / deserializing / persisting data are not built in language constructs

  2. for many tasks, we should be “data first, code second”; SQL gets this right, but is limited to SQL-ish things

  3. are you building something that is “data first, code second”, where data == redis, and code == OCaml threads ?

There is a lot to unpack here, and you have clearly thought about this much more deeply than I have.

  1. I agree with you that the Erlang let-it-crash mode definitely works. I like this idea of “micro restarts”

  2. I am agnostic on whether ‘defensive’ works (it seems to work pretty well in Rust)

  3. I do not understand how this relates to async exceptions. Can you explain the link ?

  4. That thread is 57 posts deep and a bit over my head; can you steelman the main argument from there ? I am particularly interested in the parts relating to why you believe the OCaml defensive approach is less realistic.

Thanks!

Having said all that earlier, I do want to add that I think (based on my experience with Scala), that the bare-bones actor model is a little too low-level for many or even most concurrent/distributed use cases. In the Erlang world it’s really GenServer that encapsulates the core behaviours that we want to build on top of (the synchronous client-server model). If you try building a system using purely async message-passing, it quickly gets really cumbersome.

In the Scala world it’s resource-safe IO and stream values, which I personally think would carry over pretty well to OCaml too. E.g. look at the last example in this section: Handling Resources | ZIO

val bytesInFile: IO[Throwable, Int] =
  ZIO.scoped {
    for {
      stream <- ZIO.fromAutoCloseable(openFileInputStream("data.json"))
      data   <- ZIO.attemptBlockingIO(stream.readAllBytes())
    } yield data.length
  }

In OCaml syntax this would look like:

open ZIO

let bytes_in_file filename = ZIO.scoped @@ fun () ->
  let* stream = ZIO.from_auto_closeable @@ open_file_input_stream filename in
  let+ data = ZIO.attempt_blocking_io @@ InputStream.read_all_bytes stream in
  Bytes.length data
(* val bytes_in_file : string -> (exn, int) io *)

Does that ZIO.scoped @@ fun () -> ... look familiar? It should, it’s basically what Eio’s Switch.run @@ fun sw -> ... is doing.

@lambda_foo: The high level component architecture is ordinary Redis, with all OCaml code compiled into a single Redis module. Very similar to Redis functions | Redis, just with statically compiled OCaml rather than interpreted Lua. // Redis Cluster or Redis Sentinel can be used to increase availability and/or partitioning. // The actor mailbox data flow is Capnp messages over the Redis Streams data structure, although the first releases will be ordinary Redis queues that can lose messages after a restart. New messages (actor.send) are created either internally (from an OCaml actor Lwt thread resident within the OCaml-compiled Redis module) or you can send them using ordinary Redis clients. In particular, Redis modules can block the client without blocking the server which will help with the synchronous problem that @yawaramin referred to. // Compared to A/E (Akka / Erlang)? A/E has way way more fine-grained fault tolerance; Redis fault tolerance is at the node (shard) level. A/E scales vertically on multicore; Redis would rather have many 2-core or 4-core machines. My “DkSDK” is of course statically typed, unlike Erlang. And unlike A/E, DkSDK will eventually allow actors to run on the frontend (this will require Capnp RPC, not just Capnp, and also a working Javascript Capnp RPC), so it becomes a deployment decision where actors live. I also think that DkSDK will be easier to understand than A/E because DkSDK is just a light layer on top of the already familiar Redis. Brain dumping all of that because it is very unlikely most of that will be part of the talk.

@zeroexcuses: Yes, data first rather than code first. I probably would describe it as keeping the code close to the data. But even the use of Capnp for messages requires that the data schema is written before the OCaml code is generated, so data first is a good description.

2 Likes