[ANN] T-Digest library

Github link
This is just a minor release of a pandemic project that I never announced on discuss.ocaml.org.
This library is “Complete”. There are no known bugs and no known missing features.

The T-Digest has become fairly well known in the last few years, but in short:

  • it’s a lossy data structure that allows the user to (very) accurately approximate percentiles and p_ranks without having to keep the entire sorted dataset in one place.
  • the user can combine multiple T-Digests just by concatenating them, and this can be done in the database itself!
  • both querying and insertion are blazing fast

facebook/infer has been using it for a few years, and I know it’s also used in a few closed-source projects elsewhere.

All comments and feedback is welcome! I hope this library proves useful to the OCaml ecosystem as a whole.


Interesting! Could you elaborate/point to how a static analyzer is benefiting from such data structure? It’s not obvious at all.

I’m glad you’re asking!
This commit gives a solid clue. It appears that they run Infer on large codebases and that the static checks are performed by a cluster of servers. Each server collects a number of performance metrics about the modules they analyze. The servers then accumulate the metrics using the monoidal nature of the T-Digest, MapReduce-style.

1 Like

Cool stuff! I wondered a bit about whether this could be of use for the OCaml compiler or runtime. After discussing it with @octachron, a first idea would be to use it to store the distribution of GC latencies. (This could be implemented as a Runtime_events consumer, instead of being in the compiler codebase.)

(We already store some distribution information on allocation sizes, but for this there is no need for a sophisticated approach as a histogram with fixed bins does a fine job in practice.)

This is really exciting to hear. At the moment the library uses Core for its advanced Map functions, but I’ll look into reimplementing those functions into tdigest itself to make it dependency-free and a viable choice for the compiler. I’ll ping you in this thread once it’s done.

Note that for an implementation using a runtime events consumer, there is no need to strip dependencies since those can live in a separate analyzer that subscribe to runtime events of other OCaml processes.

Thanks for the info! That is good to know.

An update on removing the dependency on Core:
I spent some time last weekend and ended up with a working version without a dependency on Core or Base. I did that in part by recreating part of the Stdlib.Map module and adding (porting) Core.Map.binary_search to it. It was all pretty straightforward, but unfortunately the performance dropped by about 11x. I deemed that to be too much.
Part of why it dropped so much is due to the data representation of Map, but also because Core.Hashtbl offers functions such as update that have to be replaced by a wasteful 2-step find+replace when using Stdlib.Hashtbl.

After testing various other ideas (including porting Core.Hashtbl.update and others), I decided to keep Core, but to also offer Tdigest.Marshallable.t. Identical in features, but internally it uses Core.Map.Make_tree as opposed to Core.Map.Make used by Tdigest.t.

Performance of Tdigest.Marshallable.t drops by 2x compared to Tdigest.t, but it’s still faster than the standard library’s Map and it can be marshalled as the name suggests.
Fast serialization using to_string/of_string and sexp_of_t/t_of_sexp is possible with either version.


Stdlib’s Map has val update : key -> ('a option -> 'a option) -> 'a t -> 'a t. Isn’t that what you need?

I’m sorry, I meant Hashtbl.update. Everything else in my post still applies to Map. I’ve edited it. Good catch!

11x is a big difference, it would be interesting if you could post the code for the two versions somewhere. It raises the question of whether the optimization to add a distinct constructor for Leaf nodes is really that effective, or if the difference comes from something else.