Especially when following the constraint: different threads (“domains” in ocaml-5 parlance) never use the same keys?
I.e. each domain uses a set of keys which is not overlapping with any other domain’s set of keys.
I later realised (after asking this question) that the stdlib’s documentation clearly says no (the documentation was updated because of ocaml-5):
“Unsynchronized accesses to a hash table may lead to an invalid hash table state. Thus, concurrent accesses to a hash tables must be synchronized (for instance with a Mutex.t).”
It is memory-safe, but it is not thread-safe, as there is no synchronization around hashtable accesses. For example, the resizing of an overgrown hashtable will happen concurrently to any other access, which means that elements might get lost, stored in the wrong buckets, etc.
The documentation says that unsynchronized access to a hashtable are a programming error. So you shouldn’t do that; either protect the table with a lock, or use a different implementation of a “concurrent” hashtable. (An atomic reference on a persistent map also works fine.)
Going in the details as you asked: in absence of resizing, a shared hashtable is safe. In presence of resizing, additions/removal may race with any other operation (including reads), so both mutations and reads need synchronization. See [multicore] does Hashtbl respect separation? · Issue #11681 · ocaml/ocaml · GitHub where I was asking basically the same question.
That does not matter. Consider removal for instance (which never resizes the hashtable). If you remove concurrently two different keys from the same bucket, you will end up with a bucket that still contains one of the keys if they were consecutive in the bucket.
The idea of “separation” might seem nice. But unfortunately, the safety condition is not that the threads never access equivalent keys, it is more like they never access the same buckets.
If you have specific criticisms or needs that you can justify, please go ahead. As such, your feedback is not actionable. Maybe you should explain what you need to do and why you have trouble doing it.
This is kind of my critique: it could/should be a hash table local/private to a domain, then it would be much more useful and the documentation could be simpler.
I don’t know that DLS is a good general-purpose solution. Also, the reason it’s complex is that the DLS needs to serve as a heterogenous map solution, which makes the typing situation complicated.
This does not make any sense to me; we use DLS keys to store one value per domain, not to store several values on a single domain. (Of course a DLS value could itself be a hashtable.) So your proposal sounds like it would not fill the need that DLS was designed for. Again, maybe you should explain what you need and why you need it. (And maybe it should be your own library and not the DLS module, if it is doing something different.)
One aspect of DLS that is known but not accurately described in the documentation yet is that the implementation is not designed to gracefully handle “lots of keys”. Datastructure implementations that would try to create a new DLS key per instance of the datastructure should use something different. The DLS module was designed for “singleton state” in modules/libraries, that should now have one state variant per domain.
This is interesting, how about in use-cases where we have a Module global hashtable which is loaded (key, value added) during module load time and thereafter only the find operations are performed. Is this still safe without the mutex and such? Or even in this use case do they need to be protected with mutex?
i.e.
module M = struct
let h = Hashtbl.create 5
let () =
Hashtbl.replace ht "a" "val a" ;
Hashtbl.replace ht "b" "val b" ;
let doa () : string =
match Hashtbl.find_opt ht "a" with
| Some v -> v
| None -> ""
let dob () ....
let doc () ...
end
Here doa, dob and doc only does find operation on ht. Is this access/usage pattern safe without mutex, et al?
I think that, with the current implementation, this is actually safe. This being said, it could not be (for example a splay tree would not satisfy this expectation). Why is everyone trying to ignore what the documentation says and come as close as possible to making their program buggy?
For this use-case I would just use a 'a Map.Make(String).t Atomic.t. It may be slightly slower, but you sleep soundly at night, you are not relying on fragile internal details of a datastructure that explicitly documents again what you are doing.