[ANN] patricia-tree 0.9.0 - library for patricia tree based maps and sets

I’m happy to announce the release of a new patricia-tree library, version 0.9.0 on opam.

This library that implements sets and maps as Patricia Trees, as described in Okasaki and Gill’s 1998 paper Fast mergeable integer maps. It is a space-efficient prefix trie over the big-endian representation of the key’s integer identifier.

For full details, visit see the documentation or the source on github.

Features

  • Similar to OCaml’s Map and Set, using the same function names when possible and the same convention for order of arguments. This should allow switching to and from Patricia Tree with minimal effort.
  • The functor parameters (KEY module) requires an injective to_int : t -> int function instead of a compare function. to_int should be fast, injective, and only return positive integers. This works well with hash-consed types.
  • The Patricia Tree representation is stable, contrary to maps, inserting nodes in any order will return the same shape. This allows different versions of a map to share more subtrees in memory, and the operations over two maps to benefit from this sharing. The functions in this library attempt to maximally preserve sharing and benefit from sharing, allowing very important improvements in complexity and running time when combining maps or sets is a frequent operation.
  • Since our Patricia Tree use big-endian order on keys, the maps and sets are sorted in increasing order of keys. We only support positive integer keys. This also avoids a bug in Okasaki’s paper discussed in QuickChecking Patricia Trees by Jan Mitgaard.
  • Supports generic maps and sets: a 'm map that maps 'k key to ('k, 'm) value. This is especially useful when using GADTs for the type of keys. This is also sometimes called a dependent map.
  • Allows easy and fast operations across different types of maps and set (e.g. an intersection between a map and a set), since all sets and maps, no matter their key type, are really positive integer sets or maps.
  • Multiple choices for internal representation (NODE), which allows for efficient storage (no need to store a value for sets), or using weak nodes only (values removed from the tree if no other pointer to it exists). This system can also be extended to store size information in nodes if needed.
  • Exposes a common interface (view) to allow users to write their own pattern
    matching on the tree structure without depending on the NODE being used.

Comparison to other OCaml libraries

ptmap and ptset

There are other implementations of Patricia Tree in OCaml, namely ptmap and ptset. These are smaller and closer to OCaml’s built-in Map and Set, however:

  • Our library allows using any type key that comes with an injective to_int function, instead of requiring key = int.
  • We support generic (heterogeneous) types for keys/elements.
  • We support operations between sets and maps of different types.
  • We use a big-endian representation, allowing easy access to min/max elements of maps and trees.
  • Our interface and implementation tries to maximize the sharing between different
    versions of the tree, and to benefit from this memory sharing. Theirs do not.
  • These libraries work with older version of OCaml (>= 4.05 I believe), whereas
    ours requires OCaml >= 4.14
  • Our keys are limited to positive integers.

dmap

Additionally, there is a dependent map library: dmap. It allows creating type safe dependent maps similar to our heterogeneous maps. However, its maps aren’t Patricia trees. They are binary trees build using a (polymorphic) comparison function, similarly to the maps of the standard library. Another difference is that the type of values in the map is independent of the type of the keys, allowing keys to be associated with different values in different maps. i.e. we map 'a key to any ('a, 'b) value type, whereas dmap only maps 'a key to 'a.

dmap also works with OCaml >= 4.12, whereas we require OCaml >= 4.14.

13 Likes

This library, developped by Dorian and me, was extracted from the Codex static analyzer, which was recently open sourced (and in a phase where we are writing documentation before doing a public announcement). For this reason, this library is particularly useful to implement efficient environments in an abstract interpreter (the performance of the merge is in O(log(n)*d), where n is the number of bindings and d the number of mappings that differ between the two maps, instead of the O(n) cost of the OCaml standard Map based on AVL trees.

2 Likes

Last time I was into integer keys, I got into feeling that we should deeply study a paper “Comparing Integer Data Structures for 32 and 64-bit Keys”. Authors have a benchmarks suite and a paper PDF in Github.

Also I found over the internet the implementations where guys are using “fat” nodes and pop count to select a right child to continue lookup (in Haskell).

Could you position your implementation with approaches from the paper and Haskell ones?

Thanks a lot @dlesbre and @Matthieu_Lemerre for releasing this library!

I have a question for you: are your sets and maps themselves hash-consed? If not, do you have plans to implement this?
(This would allow to have a constant-time equality test, save memory, and enable memoization for functions that manipulates such maps and sets!)

Minor correction when comparing with dmap (which I’m the author of).
It is correct that dmap does not implement maps from keys of type 'a key to values of type ('a, 'b) value.
However, dmap implements maps from keys of type 'a key to values of type 'a value, where 'a value can be chosen by the client (it is the argument of a functor). See the module MakeWithValue.
This is slightly less general than what your library offers, but this remains quite flexible already.

1 Like

Not yet, but with a custom NODE that should be fairly easy to do. We already have a NodeWithId which adds unique identifier to all newly generated nodes, all we need is a hashtable lookup to check if a similar node already exists.

This is a good point, we’ll include it in the next release. Cf. Draft: hashconsed maps nodes by dlesbre · Pull Request #1 · codex-semantics-library/patricia-tree · GitHub

1 Like

Nicely spotted, sorry about that, I’ll correct it.

Edit: It seems I can’t actually edit this post, but I’ve fixed it on github and in the package documentation.

Thanks for the references! The first paper does not mention Patricia Tries, so does not compare to it. Also, the comparison is done for C++, where allocation is inefficient and in-place mutation is fast, so probably the results they have would be quite different on an OCaml workload. That being said, I think the “burst” approach where we replace leaves by a small immutable array of leaves could be implemented on top of Patricia Tries – maybe with some small changes to the interface. It is probably also possible to group several interior nodes in a contiguous data structure. There are drawbacks though, notably there would be less cases where the shortcut in the merge function “if the subtrees are equal, then return the subtree as-is” would work.

Regarding HAMT, I remember that I did once replaced Patricia Tries with HAMT (I think it was this implementation: ocaml-hamt/src at master · recoules/ocaml-hamt · GitHub ) which resulted in slower benchmarks. However, Array-mapped Tries (no need for hashing in our case) are also well-suited to fast merging, so maybe a fine-tuned implementation could be faster. Note however that AMT work well if you have a dense map; normal AMT do not have compression of the key, so e.g. a singleton map from 0b1 00000 00000 requires three interior nodes if every interior node can hold up to 32 keys. Patricia trees do not have this problem (which would also be solved by the burst approach).

So, this work implements just Patricia Tries – but an optimized version of it. Comparing with other data structures would sure be very interesting – what we know is that this data structure is much more efficient that Map (based on AVL trees) for merge operation on large maps that originate from a common ancestor, a common situation in Abstract Interpretation.