Is it safe to use a global hash table in an ocaml-5 parallel program?

Especially when following the constraint: different threads (“domains” in ocaml-5 parlance) never use the same keys?
I.e. each domain uses a set of keys which is not overlapping with any other domain’s set of keys.

I later realised (after asking this question) that the stdlib’s documentation clearly says no (the documentation was updated because of ocaml-5):
“Unsynchronized accesses to a hash table may lead to an invalid hash table state. Thus, concurrent accesses to a hash tables must be synchronized (for instance with a Mutex.t).”

1 Like

It is memory-safe, but it is not thread-safe, as there is no synchronization around hashtable accesses. For example, the resizing of an overgrown hashtable will happen concurrently to any other access, which means that elements might get lost, stored in the wrong buckets, etc.

So, operations modifying the hash table should use a semaphore?

Or, even read operations should use a semaphore?

What if the hash table is created big enough and will never need to be resized at runtime?

1 Like

The documentation says that unsynchronized access to a hashtable are a programming error. So you shouldn’t do that; either protect the table with a lock, or use a different implementation of a “concurrent” hashtable. (An atomic reference on a persistent map also works fine.)

Going in the details as you asked: in absence of resizing, a shared hashtable is safe. In presence of resizing, additions/removal may race with any other operation (including reads), so both mutations and reads need synchronization. See [multicore] does Hashtbl respect separation? · Issue #11681 · ocaml/ocaml · GitHub where I was asking basically the same question.

1 Like

It is not. Consider the following trace:

(* replace *)                       (* iter *)
| Cons ({key=k; next} as slot) ->
  if H.equal k key
  then (
    slot.key <- key;
                                    | Cons{key; data; next} ->
                                      f key data;
    slot.data <- data;
    false)

If H.equal is not a strict equality but some larger equivalence relation, then Hashtbl.iter will pass the wrong data associated to a key.

I was starting to think about that: an atomic reference to a persistent map.
It could give me what I need and better guarantees.

Notice that we’re talking about different threads using different keys i.e. sharding the hashtable across domains.

If each domain only uses its own keys, you’re likely better off just creating an array with a hashtable per domain.

1 Like

That does not matter. Consider removal for instance (which never resizes the hashtable). If you remove concurrently two different keys from the same bucket, you will end up with a bucket that still contains one of the keys if they were consecutive in the bucket.

The idea of “separation” might seem nice. But unfortunately, the safety condition is not that the threads never access equivalent keys, it is more like they never access the same buckets.

That’s a good point.

Either way, if the domains aren’t communicating anyway, just have a hashtable per domain

Hence, I wonder why the documentation of Domain.DLS.new_key is so complex and the interface so restrictive.

If you have specific criticisms or needs that you can justify, please go ahead. As such, your feedback is not actionable. Maybe you should explain what you need to do and why you have trouble doing it.

1 Like

The expected type of a hash table is: 'a key to 'b value. Not 'a key to 'a.

Domain.DLS is not a hash table. It lets you store values local to a domain and access them at a later point via unique keys bound to specific values.

1 Like

This is kind of my critique: it could/should be a hash table local/private to a domain, then it would be much more useful and the documentation could be simpler.

I don’t know that DLS is a good general-purpose solution. Also, the reason it’s complex is that the DLS needs to serve as a heterogenous map solution, which makes the typing situation complicated.

This does not make any sense to me; we use DLS keys to store one value per domain, not to store several values on a single domain. (Of course a DLS value could itself be a hashtable.) So your proposal sounds like it would not fill the need that DLS was designed for. Again, maybe you should explain what you need and why you need it. (And maybe it should be your own library and not the DLS module, if it is doing something different.)

One aspect of DLS that is known but not accurately described in the documentation yet is that the implementation is not designed to gracefully handle “lots of keys”. Datastructure implementations that would try to create a new DLS key per instance of the datastructure should use something different. The DLS module was designed for “singleton state” in modules/libraries, that should now have one state variant per domain.

This is interesting, how about in use-cases where we have a Module global hashtable which is loaded (key, value added) during module load time and thereafter only the find operations are performed. Is this still safe without the mutex and such? Or even in this use case do they need to be protected with mutex?
i.e.

module M = struct
  let h = Hashtbl.create 5
  let () = 
    Hashtbl.replace ht "a" "val a" ;
    Hashtbl.replace ht "b" "val b" ;

 let doa () : string = 
  match Hashtbl.find_opt ht "a" with
  | Some v -> v
  | None -> ""

  let dob () ....
  let doc () ...
end 

Here doa, dob and doc only does find operation on ht. Is this access/usage pattern safe without mutex, et al?

I think that, with the current implementation, this is actually safe. This being said, it could not be (for example a splay tree would not satisfy this expectation). Why is everyone trying to ignore what the documentation says and come as close as possible to making their program buggy?

For this use-case I would just use a 'a Map.Make(String).t Atomic.t. It may be slightly slower, but you sleep soundly at night, you are not relying on fragile internal details of a datastructure that explicitly documents again what you are doing.

2 Likes

Because creating a data structure by mutation and then publishing it to the rest of the program in read-only fashion is a very common idiom.

This is why documenting “thread-unsafe” is insufficient. You lack for instance the distinction between what mutates and does not mutate data.

3 Likes