Is it safe to use a global hash table in an ocaml-5 parallel program?

UnixJunkie · January 10, 2023, 2:07am

Especially when following the constraint: different threads (“domains” in ocaml-5 parlance) never use the same keys?
I.e. each domain uses a set of keys which is not overlapping with any other domain’s set of keys.

I later realised (after asking this question) that the stdlib’s documentation clearly says no (the documentation was updated because of ocaml-5):
“Unsynchronized accesses to a hash table may lead to an invalid hash table state. Thus, concurrent accesses to a hash tables must be synchronized (for instance with a Mutex.t).”

silene · January 10, 2023, 4:37am

It is memory-safe, but it is not thread-safe, as there is no synchronization around hashtable accesses. For example, the resizing of an overgrown hashtable will happen concurrently to any other access, which means that elements might get lost, stored in the wrong buckets, etc.

UnixJunkie · January 10, 2023, 4:55am

So, operations modifying the hash table should use a semaphore?

Or, even read operations should use a semaphore?

What if the hash table is created big enough and will never need to be resized at runtime?

gasche · January 10, 2023, 5:46am

The documentation says that unsynchronized access to a hashtable are a programming error. So you shouldn’t do that; either protect the table with a lock, or use a different implementation of a “concurrent” hashtable. (An atomic reference on a persistent map also works fine.)

Going in the details as you asked: in absence of resizing, a shared hashtable is safe. In presence of resizing, additions/removal may race with any other operation (including reads), so both mutations and reads need synchronization. See [multicore] does Hashtbl respect separation? · Issue #11681 · ocaml/ocaml · GitHub where I was asking basically the same question.

silene · January 10, 2023, 6:11am

It is not. Consider the following trace:

(* replace *)                       (* iter *)
| Cons ({key=k; next} as slot) ->
  if H.equal k key
  then (
    slot.key <- key;
                                    | Cons{key; data; next} ->
                                      f key data;
    slot.data <- data;
    false)

If H.equal is not a strict equality but some larger equivalence relation, then Hashtbl.iter will pass the wrong data associated to a key.

UnixJunkie · January 10, 2023, 6:44am

I was starting to think about that: an atomic reference to a persistent map.
It could give me what I need and better guarantees.

bluddy · January 10, 2023, 6:50am

Notice that we’re talking about different threads using different keys i.e. sharding the hashtable across domains.

bluddy · January 10, 2023, 7:10am

If each domain only uses its own keys, you’re likely better off just creating an array with a hashtable per domain.

silene · January 10, 2023, 7:34am

That does not matter. Consider removal for instance (which never resizes the hashtable). If you remove concurrently two different keys from the same bucket, you will end up with a bucket that still contains one of the keys if they were consecutive in the bucket.

The idea of “separation” might seem nice. But unfortunately, the safety condition is not that the threads never access equivalent keys, it is more like they never access the same buckets.

bluddy · January 10, 2023, 7:46am

That’s a good point.

Either way, if the domains aren’t communicating anyway, just have a hashtable per domain

UnixJunkie · January 10, 2023, 7:52am

Hence, I wonder why the documentation of Domain.DLS.new_key is so complex and the interface so restrictive.

gasche · January 10, 2023, 3:45pm

If you have specific criticisms or needs that you can justify, please go ahead. As such, your feedback is not actionable. Maybe you should explain what you need to do and why you have trouble doing it.

UnixJunkie · January 11, 2023, 1:14am

The expected type of a hash table is: 'a key to 'b value. Not 'a key to 'a.

sudha · January 11, 2023, 5:44am

Domain.DLS is not a hash table. It lets you store values local to a domain and access them at a later point via unique keys bound to specific values.

UnixJunkie · January 11, 2023, 8:42am

This is kind of my critique: it could/should be a hash table local/private to a domain, then it would be much more useful and the documentation could be simpler.

bluddy · January 11, 2023, 9:07am

I don’t know that DLS is a good general-purpose solution. Also, the reason it’s complex is that the DLS needs to serve as a heterogenous map solution, which makes the typing situation complicated.

gasche · January 11, 2023, 9:42am

This does not make any sense to me; we use DLS keys to store one value per domain, not to store several values on a single domain. (Of course a DLS value could itself be a hashtable.) So your proposal sounds like it would not fill the need that DLS was designed for. Again, maybe you should explain what you need and why you need it. (And maybe it should be your own library and not the DLS module, if it is doing something different.)

One aspect of DLS that is known but not accurately described in the documentation yet is that the implementation is not designed to gracefully handle “lots of keys”. Datastructure implementations that would try to create a new DLS key per instance of the datastructure should use something different. The DLS module was designed for “singleton state” in modules/libraries, that should now have one state variant per domain.

BikalGurung · January 11, 2023, 1:24pm

This is interesting, how about in use-cases where we have a Module global hashtable which is loaded (key, value added) during module load time and thereafter only the find operations are performed. Is this still safe without the mutex and such? Or even in this use case do they need to be protected with mutex?
i.e.

module M = struct
  let h = Hashtbl.create 5
  let () = 
    Hashtbl.replace ht "a" "val a" ;
    Hashtbl.replace ht "b" "val b" ;

 let doa () : string = 
  match Hashtbl.find_opt ht "a" with
  | Some v -> v
  | None -> ""

  let dob () ....
  let doc () ...
end

Here doa, dob and doc only does find operation on ht. Is this access/usage pattern safe without mutex, et al?

gasche · January 11, 2023, 1:56pm

I think that, with the current implementation, this is actually safe. This being said, it could not be (for example a splay tree would not satisfy this expectation). Why is everyone trying to ignore what the documentation says and come as close as possible to making their program buggy?

For this use-case I would just use a 'a Map.Make(String).t Atomic.t. It may be slightly slower, but you sleep soundly at night, you are not relying on fragile internal details of a datastructure that explicitly documents again what you are doing.

gadmm · January 11, 2023, 2:49pm

Because creating a data structure by mutation and then publishing it to the rest of the program in read-only fashion is a very common idiom.

This is why documenting “thread-unsafe” is insufficient. You lack for instance the distinction between what mutates and does not mutate data.

Topic		Replies	Views
Thread Safety in the Standard Library Ecosystem stdlib	4	921	May 6, 2024
Shared memory examples with OCaml multicore and Hashtbl Learning multicore , multiprocessing	4	1027	November 18, 2021
Using `[@poll error]` attribute to implement systhread safe data structures Learning	3	981	August 13, 2023
C bindings under OCaml 5: protecting data structures Community c , multicore-ocaml	2	672	March 4, 2023
Threads and atomicity of reading and assignement Learning	4	613	December 2, 2021

Is it safe to use a global hash table in an ocaml-5 parallel program?

Related topics