No Domain.maximum_domain_count() in the stdlib

There is only Domain.recommended_domain_count.
If I want to know the maximum number of domains that can be created,
there is no programmatic way?

Domain.recommended_domain_count returns the maximum number of domains you can profitably use in your machine. When in doubt, it’s best to stick to this number. Another detail one should be careful about is the runtime is not fine-tuned to handle no. of domains > no. of threads in the machine. If your program ends up doing this, you may actually observe slowdowns.

There is indeed a limit on the maximum number of domains you can create at a time, and currently it is 128. This is an implementation detail in the runtime and may change in future. This is not something we should rely on while writing programs with Domains, as it is not set in stone.

Edit: as @octachron pointed out, the limit is on the maximum number of domains you can create simultaneously.

2 Likes

It would be better to be able to know this maximum value programmatically, rather than having to start a program with an increasing number of domains until it crashes.

Do you actually need to create more than 128 domains? I can think of two reasons why you might be doing it:

(1) You are running your programs on a machine with more than 128 cores.
(2) The granularity of tasks for which a new domain is created is small.

If it’s (1), it could be useful to create an issue for this on ocaml/ocaml if there isn’t one already. For (2) you might want to re-evaluate the granularity of work done by a new domain.

I remember it is mentioned somewhere in the docs that domains are not really meant to be created dynamically, rather once, with a perhaps configurable number, at the beginning of the program is the usual.

1 Like

I also would like to know the maximum. When writing lock-free algorithms I sometimes need to pack a counter of running operations into part of an int, and I need to know how many bits to reserve for this. I’d like to write code such as:

let () =
  assert (Domains.max_domains < x01000)
1 Like

Note that this is not a good reasons for creating more domains than Domain.recommended_domain_count, the performance of the stop-the-world minor GC will drop significantly when there are many dormant domains to wait for.

@UnixJunkie , for all intents and purpose, you should treat Domain.recommeded_domain_count as the maximum of domains that should run in parallel in a normal program.

And there is no maximum number of domains that can be created (there is a maximal number of domains that can be alive simultaneously however).

2 Likes

An exception is thrown when you try to create too many domains.
I don’t remember it, but I have already seen it.

It is very likely that people will use more than Domain.recommended_domain_count.
Sure, if I have N hardware threads, I want N workers.
But, there will be additional domains launched, which have less work to do and hence are less loaded (sometimes, those “maintenance” threads will be dormant).

As far as I understand, it’s a programming error. You should not launch more domains in parallel than your hardware can support, and Domain.recommended_domain_count () is supposed to give you this number. But, if you want to launch more task, then they should not be parallel but concurrent using fibers in each of the domain you’ve spawned.

It seems very unlikely to me. The messaging from the beginning has been to use only a fixed static number of domains equal to the number of cores, and to multiplex many green threads on top of those domains using something like domainslib’s tasks or eio’s fibers. This concept is very similar to e.g. Go’s goroutines or Java’s virtual threads, so most people are likely to be familiar with it now and especially in the future.

There is going to be a binary that reads some environment variable to decide on the number of domains. (Something similar to make’s or dune’s -j argument.)

When writing this bianry, it’d be useful to do a sanity check on the value before spawning the domains. Something that prints a warning if the number is bigger than recommended_domain_count and prints an error (and exits with non-0) if the number is bigger than the maximum domain count. That’s UX I would expect.

So I’d agree with @UnixJunkie that it is likely and also that programmatically querying the maximum number of domains is useful. It’s like int_size: a constant of the runtime that most programs don’t need but that some advanced programs may need to know about.

2 Likes

This is already a sign of a programming error somewhere. It may make sense to fail and exit at this stage (or going even few domains above this number).

The maximal number of domains is an implementation detail of the runtime. Moreover, the runtime makes no promise to scale in a reasonable way up to this point. No programs should depend on this number, no matter how advanced they think they are.

In other words, the maximum number of domains is specified as greater or equal than the recommended number of domains.

Integer size is an implementation detail of the runtime (as opposed to Sys.word_size which is a detail of the processor). Making Sys.int_size available is still useful for some corner cases.

Sys.max_string_length and other Sys.max_*_length too. They are implementation details of the runtime that can trigger some exceptions if exceeded. Just like the maximum number of domains. Similar to the domains, it probably makes sense to fail earlier and switch to some differently-allocated structures (bigarrays or bigstrings or something like that).

1 Like

I wish you good luck in mixing parallelism with concurrency and still being able to reason about your programs.

To be honest, I don’t care. I never use mutable value, and so from a reasoning point of view parallelism and concurrency are similar. By the way, parallelism is intended to improve your code performance (if used correctly). But it’s clear from OCaml documentation that if you launch more domain than Domain.recommended_domain_count () then you will observe a slowdown instead of an improvement.

1 Like

The integer size is part of the specification of the language. Even more so since OCaml arithmetic is defined as modular arithmetic on ℤ/(ℤ^Sys.int_size).

Those are both specified by the language too. The constraint of array sizes is an important limitation of 32 bit implementation of OCaml, and those constant are defined in the language to make it possible to handle large data.

In the case of the maximum number of domains, first the API is still in flux: it is not certain that the hard limit will stay (it might become possible to set this number either with a runtime parameter or programmatically in later versions of OCaml, or the maximum number of domains could be increased dynamically). Second, if a reasonable program observes that the maximum number is not +∞, this is probably a bug in the runtime itself rather than the program.

1 Like

What specifications?

There are three different values that the integer size can be (that I know of; for now). It’s not a fixed element of the language: it’s a runtime-implementation detail. This runtime implementation detail has an impact on the semantics of some of the functions exported in the stdlib (e.g., `(+)). As such it is useful to be able to programmatically query it.

Similarly, the function spawning domains behaves differently depending on the maximum number of domains (it raises an exception or it doesn’t). As such it is useful to be able to programmatically query it.

Now that’s an argument I can get behind. If the limit is likely to disappear in a couple of releases then indeed I can understand it not being added to the Stdlib.

Is it really? I have yet to read an explanation of why a program would need to create more than the recommended number of domains, especially since you are guaranteed that creating a domain will never trigger an exception if you don’t (modulo resource exhaustion).

One thing to keep in mind is that recommended_domain_count tries hard to convey the intent of the user. Consider the following trivial program:

let () = Printf.printf "max = %d\n%!" (Domain.recommended_domain_count ())

If the user forbids this program to use more than 3 cores, then recommended_domain_count actually says so:

$ taskset 25 ./a.out
max = 3

Why would you want to go past that limit? The operating system will force any extra domain to fight over a scheduling slice with the currently running domains anyway.

If your program is IO-bound, adding more domains will not make it faster. If your program is memory- or compute-bound, adding more domains will certainly make it slower.

1 Like

You write a program that automatically spins recommended_domain_count domains. Everything works well on your 8-core machine when suddenly you get a bug report. You can’t reproduce locally. And after some exchange you realise that they have a 16 core machines. You need to spin more domains than the limit just so that you can reproduce the bug.

You fix the bug. You decide to avoid these future bugs. You want the CI to run your test suite with different numbers of domains. You make a job for 1 domain, another for max_domain / 4, another for max_domain / 2 and another for max_domain. Sure those many-domained jobs will probably run slower than the single-domain job (what with the single-core worker you get for free from gitub). But the alternatives don’t look too nice.

And that’s an answer to the question “Why would you want to go past that limit?”. I’m actually not advocating to go past that limit (at least not in the general case). But wanting to know the limit is different from wanting to go past the limit. For example, as @talex5 mentioned:

When you write a library, it’s important to make sure it can handle whatever the user throws at it—or fail gracefully with a relevant error message otherwise if it can’t.

2 Likes

It’s just a data point, not a stat. But from my experience, the likelihood of reproducing the bug with half the cores and a bunch of domain fighting to be scheduled is close to none.
It’s like saying you’ll simulate having more ram by increasing the size of your swap partition, you may find some issues, but they’ll be very different from the original bug report.

2 Likes