This is just a minor release of a pandemic project that I never announced on discuss.ocaml.org.
This library is “Complete”. There are no known bugs and no known missing features.
The T-Digest has become fairly well known in the last few years, but in short:
- it’s a lossy data structure that allows the user to (very) accurately approximate percentiles and p_ranks without having to keep the entire sorted dataset in one place.
- the user can combine multiple T-Digests just by concatenating them, and this can be done in the database itself!
- both querying and insertion are blazing fast
facebook/infer has been using it for a few years, and I know it’s also used in a few closed-source projects elsewhere.
All comments and feedback is welcome! I hope this library proves useful to the OCaml ecosystem as a whole.
Interesting! Could you elaborate/point to how a static analyzer is benefiting from such data structure? It’s not obvious at all.
I’m glad you’re asking!
This commit gives a solid clue. It appears that they run Infer on large codebases and that the static checks are performed by a cluster of servers. Each server collects a number of performance metrics about the modules they analyze. The servers then accumulate the metrics using the monoidal nature of the T-Digest, MapReduce-style.
Cool stuff! I wondered a bit about whether this could be of use for the OCaml compiler or runtime. After discussing it with @octachron, a first idea would be to use it to store the distribution of GC latencies. (This could be implemented as a Runtime_events consumer, instead of being in the compiler codebase.)
(We already store some distribution information on allocation sizes, but for this there is no need for a sophisticated approach as a histogram with fixed bins does a fine job in practice.)
This is really exciting to hear. At the moment the library uses Core for its advanced Map functions, but I’ll look into reimplementing those functions into
tdigest itself to make it dependency-free and a viable choice for the compiler. I’ll ping you in this thread once it’s done.
Note that for an implementation using a runtime events consumer, there is no need to strip dependencies since those can live in a separate analyzer that subscribe to runtime events of other OCaml processes.
Thanks for the info! That is good to know.
An update on removing the dependency on Core:
I spent some time last weekend and ended up with a working version without a dependency on Core or Base. I did that in part by recreating part of the
Stdlib.Map module and adding (porting)
Core.Map.binary_search to it. It was all pretty straightforward, but unfortunately the performance dropped by about 11x. I deemed that to be too much.
Part of why it dropped so much is due to the data representation of
Map, but also because
Core.Hashtbl offers functions such as
update that have to be replaced by a wasteful 2-step
replace when using
After testing various other ideas (including porting
Core.Hashtbl.update and others), I decided to keep Core, but to also offer
Tdigest.Marshallable.t. Identical in features, but internally it uses
Core.Map.Make_tree as opposed to
Core.Map.Make used by
Tdigest.Marshallable.t drops by 2x compared to
Tdigest.t, but it’s still faster than the standard library’s Map and it can be marshalled as the name suggests.
Fast serialization using
t_of_sexp is possible with either version.
Stdlib’s Map has
val update : key -> ('a option -> 'a option) -> 'a t -> 'a t. Isn’t that what you need?
I’m sorry, I meant
Hashtbl.update. Everything else in my post still applies to
Map. I’ve edited it. Good catch!
11x is a big difference, it would be interesting if you could post the code for the two versions somewhere. It raises the question of whether the optimization to add a distinct constructor for Leaf nodes is really that effective, or if the difference comes from something else.