The Str module in OCaml is known to be not thread safe (it relies on global state to store/retrieve match offsets: those APIs only get unit as a parameter and not the regex). Re itself doesn’t have this problem, as long as you avoid sharing a regex across threads.
It provides a neat Re.Str to be able to run regular expressions using theStr syntax. The documentation for Re.Str doesn’t mention anything about thread safety issues. So then we should be able to switch our code from using Str to using Re.Str and it’d be safe, right? Or at least if we limit ourselves to APIs that do not rely on global state?
Looking at the implementation it uses a single state global that is not protected by any mutexes, so even APIs like Re.Str.split or Re.Str.global_substitute are not safe to use in multi-threaded applications: the (temporary) values from one Re.Str match on one thread could overwrite the values used by another regex on another thread.
I think this is because the Str API is fundamentally not thread-safe, however it should be possible to support the Strsyntax without providing 1:1 compatibility with the Str API (otherwise what would be the benefit of switching from Str to Re.Str if not to address thread-safety?)
It’d be good to:
add some warnings to Re.Str API that it is not thread safe
add some warnings to OCaml’s own Str that it is not thread-safe (even if that is obvious)
deprecate Re.Str in favour of another Re.SafeStr that uses a slightly different API that would be thread-safe
document the thread-safety of Re itself: using the same (compiled) regex from multiple threads doesn’t seem to be safe since the type contains some mutable state. Using different regexes from different threads should be safe though, so one would need mutexes only around single regexes, and not globally around all regexes.
I’ll try to find some time to create PRs for these, but thought to let others know here.
(Note that this is unrelated to multicore: this is already not safe with regular OCaml threads)
If I remember correctly, this mutable state is just a cache. So, using the same regex should be thread-safe, in the sense that the worst that can happen is that an already computed result needs to be recomputed because a race condition caused it to be dropped from the cache.
Maybe there is a misconception here? I assumed you can compile a regexp to internal form and use that for matching in multiple cases. But that does not work if this internal form carries state. Instead, you would need to compile the regexp every time you want to start matching a string such that this match operation has its internal state. This is not obvious from the API.
Are you sure? I see a field called desc of type Automate.State.t that has a status field with element Match of mark_infos * Pmark.Set.t | Running.
I understand the need to compute/cache the DFA on the fly (and that this should be equivalent even in the case of races), but this seems to be match state, not automata state.
But that would mean that Re is broken even in the single-threaded case, as you would not be able to use the same regex to match two different strings concurrently, e.g, using Seq.split re foo twice.
I don’t think that it is necessarily broken, just that the types used don’t obviously guarantee its correctness. Or maybe I’m just misinterpreting it, and the match state (running/failed/marks with an int array) isn’t actually mutable, it is just that an array was the easiest way to store a list of numbers, and match state refers to where you are in the automata, i.e. a final state or not.
A quick test where I use a regex with groups matching different strings in different positions with 3 threads doesn’t show any inconsistencies, but that doesn’t mean it is correct, more fuzzing would be needed.
However I do see a Hashtbl (Automata.State.Table.t), part of the states field in type re, and even if it is meant to be idempotent I don’t think that modifying a Hashtbl (without any locks) from multiple threads concurrently would be safe: you might temporarily see some inconsistencies (e.g. during resizing a Hashtbl?).
Also none of the unit tests in Re use threads, which makes me think that thread safety may not have been considered during design (the documentation doesn’t say either way).
I quite like Re from an API point of view though, and these internal thread safety issues (if any) might be fixable, e.g. even if Hash tables are faster, using a ref to a Map might be safer when multiple threads are a possibility, and the arrays should probably be wrapped in a such a way to make it obvious they are read-only.
Are you talking about multicore threads or regular threads? Regular threads will not tell you much about the thread safety of Re, since the OCaml runtime guarantees that only one thread will ever execute OCaml code at a given time. And since new threads are scheduled for execution on system calls only (as a rule of thumb), the code from Re will always run to completion without being interrupted.
In the multicore version of ocaml, the Str module is reputed to have been made thread safe, after I raised the issue with the maintainers some six or more months ago, although I have not tested it. I believe it now uses domain-local state so each domain/thread has its own version. Notwithstanding the fact that it is (I believe) now thread safe for ocaml-5, Str still of course has a horrible interface, but that is another matter.
I have never used Re’s emulation of Str, but if it is not thread safe it could be made so in the same way by using thread-local state.
I can confirm that the global mutable state in trunk Str is domain-local, so the same ugly imperative API will work without issues in a multi-domain scenario. Of course, if you use Threads (or fibers, etc.), you may still be in trouble if there is a yield in-between a search and the access to the matched groups.
I think that is somewhat misleading. As I understand it, it is “thread-safe” in so far as two different domains accessing Str will not interfere with one another, as each will see its own version of the global data. However if you use the old Thread interface within any one domain (why do it?), and the thread in question makes a system call which yields to another in the same domain before the global data has been accessed by it, it might see the wrong value if and when it resumes, which may or may not cause your program to “crash and burn”. If you are using the Thread module, you can I think avoid encountering a wrong value by ensuring that the application of, say, Str.matched_string occurs immediately after the function call (say, Str.search_forward) which sets the global value.
That is about right. And the important point is that the user code has complete control. The same way the code should never call a Str-using function between some calls to Str.search_forward and Str.matched_string, it should never call a function that might yield to another fiber/thread. So, in the end, it is a matter of documentation and code discipline.
Do users have control over when thread switching happens? I thought that when using the native runtime it can happen at any point OCaml checks for signals, i.e. around the places where the garbage collector can run. The only way to ensure that OCaml doesn’t switch threads (talking pre-multicore here) is to avoid allocation and avoid calling syscalls. IIUC those are the principles behind Core’s Nano_mutex, but it is a difficult property to ensure about larger pieces of code.
There is this code in the runtime ocaml/st_posix.h at 4.12 · ocaml/ocaml · GitHub that runs every ~50ms once you have more than one thread that delivers a fake signal to itself, which tweaks the minor heap such that it enters the GC early, and i think that will in the end cause OCaml to switch to another thread.
I agree the API and behavior of Str are not thread-safe. I was just pointing out that the code was changed to ensure that the state is domain-local, so that the problem is no worse on 5.0 than before.
I think that the point that you can just reason about when yield occurs is only semi-valid: it is true in entirely cooperative libraries (@edwin is pointing out that Thread isn’t), but then in practice most user code does not care about where yield occur and makes it difficult to reason about them.
If we want robustness, we should use explicit locking around the resource, but I think it’s much better to move to an API without global state. Especially in this case where it’s not very difficult.
So just to clarify, it’s domain (system-thread)-safe, but not fiber (green thread)-safe, since fibers can switch between invocations if one is not careful.