A hack to implement efficient TLS (thread-local-storage)

Currently OCaml 5 provides a Domain.DLS module for domain-local storage.

Unfortunately,

  1. there is no corresponding Thread.TLS for (sys)thread-local storage, and

  2. the current implementation of Domain.DLS is not thread-safe.

I don’t want to spend time to motivate this topic, but for many of the use cases of Domain.DLS, what you actually want, is to use a Thread.TLS. IOW, many of the uses of Domain.DLS are probably “wrong” and should actually use a Thread.TLS, because, when using Domain.DLS, the implicit assumption is often that you don’t have multiple threads on the domain, but that is typically decided at a higher level in the application and so making such an assumption is typically not safe.

Domain.DLS is not thread-safe

I mentioned that the current implementation of Domain.DLS is not thread-safe. What I mean by that is that the current implementation is literally not thread-safe at all in the sense that unrelated concurrent Domain.DLS accesses can actually break the DLS. That is because the state updates performed by Domain.DLS contain safe-points during which the OCaml runtime may switch between (sys)threads.

Consider the implementation of Domain.DLS.get:

  let get (idx, init) =
    let st = maybe_grow idx in
    let v = st.(idx) in
    if v == unique_value then
      let v' = Obj.repr (init ()) in
      st.(idx) <- (Sys.opaque_identity v');
      Obj.magic v'
    else Obj.magic v

If there are two (or more) threads on a single domain that concurrently call get before init has been called initially, then what might happen is that init gets called twice (or more) and the threads get different values which could e.g. be pointers to two different mutable objects.

Consider the implementation of maybe_grow:

  let maybe_grow idx =
    let st = get_dls_state () in
    let sz = Array.length st in
    if idx < sz then st
    else begin
      let rec compute_new_size s =
        if idx < s then s else compute_new_size (2 * s)
      in
      let new_sz = compute_new_size sz in
      let new_st = Array.make new_sz unique_value in
      Array.blit st 0 new_st 0 sz;
      set_dls_state new_st;
      new_st
    end

Imagine calling get (which calls maybe_grow) with two different keys from two different threads concurrently. The end result might be that two different arrays are allocated and only one of them “wins”. What this means, for example, is that effects of set calls may effectively be undone by concurrent calls of get.

In other words, the Domain.DLS, as it is currently implemented, is not thread-safe.

How to implement an efficient Thread.TLS?

If you dig into the implementation of threads, you will notice that the opaque Thread.t type is actually a heap block (record) of three fields. You can see the Thread.t accessors:

#define Ident(v) Field(v, 0)
#define Start_closure(v) Field(v, 1)
#define Terminated(v) Field(v, 2)

and the Thread.t allocation:

static value caml_thread_new_descriptor(value clos)
{
  CAMLparam1(clos);
  CAMLlocal1(mu);
  value descr;
  /* Create and initialize the termination semaphore */
  mu = caml_threadstatus_new();
  /* Create a descriptor for the new thread */
  descr = caml_alloc_3(0, Val_long(atomic_fetch_add(&thread_next_id, +1)),
                       clos, mu);
  CAMLreturn(descr);
}

The second field, Start_closure, is used to pass the closure to the thread start:

static void * caml_thread_start(void * v)
{
  caml_thread_t th = (caml_thread_t) v;
  int dom_id = th->domain_id;
  value clos;
  void * signal_stack;

  caml_init_domain_self(dom_id);

  st_tls_set(caml_thread_key, th);

  thread_lock_acquire(dom_id);
  restore_runtime_state(th);
  signal_stack = caml_init_signal_stack();

  clos = Start_closure(Active_thread->descr);
  caml_modify(&(Start_closure(Active_thread->descr)), Val_unit);
  caml_callback_exn(clos, Val_unit);
  caml_thread_stop();
  caml_free_signal_stack(signal_stack);
  return 0;
}

and, as seen above, it is overwritten with the unit value before the closure is called.

What this means is that when you call Thread.self () and get a reference to the current Thread.t, the Start_closure field of that heap block will be the unit value:

assert (Obj.field (Obj.repr (Thread.self ())) 1 = Obj.repr ())

Let’s hijack that field for the purpose of implementing an efficient TLS!

Here is the full hack:

module TLS : sig
  type 'a key
  val new_key : (unit -> 'a) -> 'a key
  val get : 'a key -> 'a
  val set : 'a key -> 'a -> unit
end = struct
  type 'a key = { index : int; compute : unit -> 'a }

  let counter = Atomic.make 0
  let unique () = Obj.repr counter

  let new_key compute =
    let index = Atomic.fetch_and_add counter 1 in
    { index; compute }

  type t = { _id : int; mutable tls : Obj.t }

  let ceil_pow_2_minus_1 n =
    let n = n lor (n lsr 1) in
    let n = n lor (n lsr 2) in
    let n = n lor (n lsr 4) in
    let n = n lor (n lsr 8) in
    let n = n lor (n lsr 16) in
    if Sys.int_size > 32 then n lor (n lsr 32) else n

  let[@inline never] grow_tls t before index =
    let new_length = ceil_pow_2_minus_1 (index + 1) in
    let after = Array.make new_length (unique ()) in
    Array.blit before 0 after 0 (Array.length before);
    t.tls <- Obj.repr after;
    after

  let[@inline] get_tls index =
    let t = Obj.magic (Thread.self ()) in
    let tls = t.tls in
    if Obj.is_int tls then grow_tls t [||] index
    else
      let tls = (Obj.magic tls : Obj.t array) in
      if index < Array.length tls then tls else grow_tls t tls index

  let get key =
    let tls = get_tls key.index in
    let value = Array.unsafe_get tls key.index in
    if value != unique () then Obj.magic value
    else
      let value = key.compute () in
      Array.unsafe_set tls key.index (Obj.repr (Sys.opaque_identity value));
      value

  let set key value =
    let tls = get_tls key.index in
    Array.unsafe_set tls key.index (Obj.repr (Sys.opaque_identity value))
end

The above achieves about 80% of the performance of Domain.DLS allowing roughly 241M TLS.gets/s (vs 296M Domain.DLS.gets/s) on my laptop.

9 Likes

This is very interesting. You’re saying there’s field in Thread.t that is unused after the thread has started, and we could use it to stash an array for TLS purposes.

Can this be proposed upstream? The lack of TLS in the stdlib is a big mystery to me. @ELLIOTTCABLE has implemented ambient-context which provides an implementation (and I have, too, in local things) but nothing remotely as efficient as what you propose.

1 Like

Unless I am missing something, it should not be very hard to implement TLS in the threads implementation – add a field to Caml_state that is saved in caml_thread_struct. If there is motivation to do it, why not do this instead of a hack?

This sounds very nice, I could make use of a fast TLS implementation. I would say that the interest is that current OCaml versions already support it. While waiting for upstream to get in motion, it would be nice to have your implementation already on opam. Otherwise a risk is that everyone start re-using this field in an uncoordinated fashion.

For Domain.DLS not being thread-safe, it might be worth opening an issue to see to which degree this could be fixed.

1 Like

Just a note that for those using Eio, there is fiber-local storage: Fiber (eio.Eio.Fiber)

I am not intending to criticize your remark or hack (which I agree is a hack, in a non-derogative way), but rather to suggest that it is probably easier and less work for everyone to have someone interested in TLS propose something upstream with a clean implementation (not a hack). It may require making a case that TLS is worth adding to the runtime, but I don’t perceive this as a very difficult argument to make.

The vibe I get from your answers is that you are not yourself interested in doing this. Maybe we could invite @c-cube to try that (interested in a week-end of C runtime hacking?), or anyone of the people currently contributing to the runtime.

2 Likes

Having an implementation already on opam and seeing it be used, especially when it relies on a hack, would be the best argument for the lengthy and disheartening labor of trying to have something upstream a couple of years later.

Last time I looked at Domain.TLS, the inerface was very bad.
If this is a key-value store, then it should have a hashtable kind of interface.
From my current understanding, it is indeed a hash table and each domain has a different view of what’s inside.

Simple interfaces, with well named methods are very important.
Well named types are also very important.
If you need a lot of ocamldoc to tell what the thing is doing, you are doing a bad job at designing your thing.

Domain.DLS does present a hash table interface though. The functions are new_key, get, and set. That seems pretty simple to me.

The proper interface of a hash_table is: [create n], [add k v], [find k].
By the name, I have no idea of what [Domain.DLS.new_key] is supposed to do.

But, since the parallelization performance of fork-based OCaml programs is superior
to the parallelization performance of ocaml-5 thread-based programs, I don’t care anymore
about what’s in the Domain module to be honest.

I have never used the Domain.DLS module, so I have not looked into it, but I imagine Domain.DLS.new_key, Domain.DLS.set and Domain.DLS.get are intended to be analogous to pthread_key_create, pthread_setspecific and pthread_getspecific. If that is the case it seems a reasonable choice of names. At any rate, that is where I would start looking were I to want to use domain-local storage in OCaml.

Are you referring to that one specific microbenchmark you did comparing them? I don’t think that result can be generalized to a range of production workloads especially if we are adding concurrent I/O into the mix.

5 Likes

We considered mirroring the DLS interface in systhreads at some point during the development of Multicore. I had an implementation but it was lost due to lack of time and possibly indecisiveness around the possible interactions between DLS and the proposed TLS. (I can’t remember the finer details.)

I do not think it is a hard argument to make either, I do not think I have time to work on it, I would however be glad to review any implementation and discuss the finer details.

I am definitely not a huge fan of code running in the wild exploiting runtime internals. The simpler QA checks are for OCaml releases with a well behaved ecosystem, the better. :slightly_smiling_face:

6 Likes

I’m sorry to say that it’s too late now, the cat is out of the bag: GitHub - c-cube/thread-local-storage: thread-local storage for OCaml (just released 0.1 :slightly_smiling_face:). The only fix for that is to have a similar, but clean, TLS implementation in the stdlib.

I’m a bit sad that it was considered a few years ago and did not happen, though. If anything it’s DLS that, imho, seems less important.

1 Like