No Domain.maximum_domain_count() in the stdlib

Is the reason people are asking for an api to know the limit becase the current limit is considered “too low”?
I don’t remember anyone asking for max number of Lwt threads.
I the Java community where I spend most of my professional time, there is a limit in the number of threads (based on how much stack space you have) but the JVM makes no effort to compute a maximum, and I can’t remember anyone asking for one.
So will that question disappear if the runtime can make that maximum relatively high? (and what’s “high” in that context?)

From what I’ve seen in the parellel and concurrent programs I’ve have to deal with, the behaviour of the program degrades as you approach whatever limit is imposed by the runtime and/or hardware, so trying to stick close to Domain.recommended_domain_count sounds like a good idea to me.

1 Like

Threads are not always running.
Sometimes, they are blocked, waiting passively for something to happen (like an increment
in a counting semaphore).

Parany uses N parallel workers + 2.
I.e. Parany always overloads the machine by two threads, if you have asked to use all cores
of your computer.
Those two extra threads are maintenance threads, they are not supposed to be always running
and doing heavy computation.
The N other threads are expected to be heavy and almost always running, unless they are waiting
for the jobs queue to fill, or the results queue to decrease.

N is the number of cores requested by the user (so he will usually choose between 1, no parallelization, and up to getconf _NPROCESSORS_ONLN).
I am pretty sure, if you only use N-2 workers (i.e. limit yourself to Domain.recommended_domain_count threads in total), the parallelization performance will be lower.

What if you set it to use Domain.recommended_domain_count + 2 domains in total? How about 2 * Domain.recommended_domain_count? At which point in this spectrum would yield best performance? My guess would be the former?

At the risk of attributing thoughts to someone else (and of being corrected if so):

I don’t think the message you are replying to is really about the specific number on this specific use-case. I think this is an illustration that occasionally you might want the number of threads to exceed the number of cores. And it’s also, by implication, an illustration of the issue of not having a way to query the maximum number of domains:

Whether the best performance is with Domain.recommended_domain_count + 2 or Domain.recommended_domain_count * 2, the issue would be that you cannot be certain that either of those numbers are safe.

Something like max (Domain.recommended_domain_count + 2) Domain.max_domain_count on the other hand would be safe.

I’m not convinced that this example (which seems to be the actual motivating example, since the OP is the creator of parany) illustrates a problem. Here’s my thought process. Recommended domain count is basically the number of cores on the machine. On a machine with fewer cores (let’s say 4), it should be safe to spin up 4 + 2 = 6 domains because that’s a pretty small number. On a machine with a lot of cores, say 256, recommended domain count will be, for argument’s sake, 256. That’s a lot of domains. We can afford to reserve a couple of them for maintenance/background work and still squeeze performance out of the rest of the cores.

My argument is that the actual number of domains you want depends on both your workload and the machine it’s running on. For many (most?) workloads it will be the recommended domain count. For specialized workloads like parallel compute engines, it will need more specialized design. But in either case, I don’t see a motivation to have so many domains that we run the risk of hitting the ‘too many domains’ exception.

Mwell I’m not sure Parany has the right motivation for the requested max domains API. Parany creates (N + 2) domains for each of its parallel operation: Do a parmap inside a parmap and you’ve got 64 domains (+18 “dormant”) running on your N=8 cpu cores… Hopefully it’s slow enough that you won’t have the patience to go deeper :smiley:

@talex5’s usecase is much more sensible (“I won’t personally go over the recommended limit but a user might”), as having an upperbound allows a cheap DCAS by packing small integers into one word (for example “number of active readers on my datastructure | number of pending writers”, or this code in eio).

Ignoring OCaml runtime current limit, we could ask the kernel /proc/sys/kernel/threads-max for that upperbound (failing to allocate a new thread/domain could happen much earlier of course!) But then, would we really want to assert on boot if the system is configured with a huge maximum that will never actually be used? (On the other end, it doesn’t seem like a good idea to pack a lot of byte-sized counters into a single int Atomic.t on the assumption that the OCaml upperbound is currently very low.)

(“65,535 is more domains than anyone will ever need”, but hopefully software would be written significantly differently when that time comes? :stuck_out_tongue: )

A parmap inside a parmap is very likely a programming error.
Usually, only the external map should be parallelized, the inner one should be a sequential (classic) List.map.

If you use more than the recommended number of domains, the performance of the program will probably fall off a sharp cliff – in my tests they sometimes become much slower than using just 1 domain. We haven’t documented that carefully enough and I would like to clarify it in the documentation. I repeat, OCaml domains do not behave as native pthreads as people may be familiar with, they are heavier abstractions.

There are niche use-cases for using a non-resizable array indexed over the number of domains, but as pointed above we don’t know if we want to advertise the existence of a fixed limit, as it may vanish later. In the meantime people who really want to write advanced, low-level code for this should access the runtime internals through C – and accept that they are relying on unstable aspects of the runtime, with no compatibility guarantees.

Another reason to shy away from publishing the maximum number of domains is that it might be prone to misuse by users if not documented very carefully – and again, currently our documentation is not good enough. This might hurt more people than it would help, and is a good reason to take things slow.

@UnixJunkie: If you want to spawn many tasks in parallel, don’t try to spawn many domains. Use a fixed-size domain pool and send the work there. This is what domainslib does explicitly; its a first iteration and I am sure that other designs will be proposed in this space (see eg. idle-domains). Use a domain pool.

6 Likes

I created an upstream issue to track that we need to improve the documentation: Domain: document clearly that creating more domains than recommended is a terrible idea · Issue #11921 · ocaml/ocaml · GitHub . I plan to do it eventually, but anyone should feel free to beat me to it and submit a PR improving the documentation.

And if you spawn domains yourself, people will blame you for that and say that is why you don’t get good performance for your program. :smile:

@UnixJunkie Actually, this is not such a bad way of seeing things. Sometimes, when writing parallel code, you need to be aware of CPU cores and treat them as a resource. See per-CPU data structures in modern concurrent memory allocators, or restartable sequences in Linux. A domain is an abstraction for this resource. It makes sense that most programmers would not want such power (and responsibility), and would prefer to go with a library abstraction like idle-domains. It is fitting that multicore OCaml did not come with just Domain, but from the get-go was offered with such a library domainslib.

1 Like

I wonder if it was considered at some point that parallel programmers would explicitly mark values which are to be shared between threads, and the expected access mode (read-only, write-only, read-and-write). Only such values would require being handled by a parallel GC; other values would not require parallel GC treatment. When parallel programming, some people like to have more control and less abstraction. Personally, I would have loved to control what goes into the shared memory, and what are the allowed access modes to it. For values which are unmarked, they should not be visible to other threads and impossible to access from other threads.

If this is expected to be statically enforced, it would suspect that you will need Rust-style ownership and borrowing. IIUC, making such a system work nicely with the rest of the program where objects may be GCed is still an open challenge.

Perhaps looking at the problem from the other direction is useful. Explicitly mark things which are not going to be shared between domains. The compiler-checked stack allocation for OCaml seems like a great starting point for this.

1 Like

This is a much bigger set, and manual annotations always must be kept to a minimum (programmers hate them, and programmers make errors).
If the program is purely sequential, all values would have to be marked as “unshared”.
I suspect that some users (like Facebook), put all their shared values in a giant hash table,
hence they know that all things they get from this hashtbl are shared.
Same for users of your Domainslib.Chan or other data-structures in this library: if you retrieve anything
from those data-structures, you know it is a shared value.
If you put something in there, it is because you want it to be shared.

If we do want to go in the other direction, i.e, marking things as shared, and we want to use the type system to capture this, then we will need to do something about function types. Given that we have closures, function’s type alone doesn’t fully describe what is shared; closure’s environment will also have to be marked as shared. Cloud Haskell - HaskellWiki is a good starting point for what the type system should look like. The original paper is a nice read: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.953.7791&rep=rep1&type=pdf. I suspect that modes will help here as well.

I realise that we’re off on a tangent with this discussion from the original intention of the post. Perhaps this can be discussed in a different thread.

1 Like

Yes, see here: Multicore Update: April 2020, with a preprint paper - #28 by gadmm.