Concestor data structures

Several years ago I spent a lot of time figuring out how to wrap imperative data structures with undo buffers in order to combine the elegance of purity with the performance of mutation. You may have heard of functional arrays aka diff arrays that use an equivalent approach in the simpler case of an inextensible array. My concestor collections extend this idea to Sets and Maps. The word concestor means nearest common ancestor, referring to the spaghetti stack of diffs between different versions of a persistent data structure. Replacing trees with hash tables results in dramatic speedups in almost all practical applications IME.

Prompted by comments here I just ported this to OCaml. I have uploaded my simple but efficient open addressed hash tables and a concestor Map derived from it. My OCaml fu isn’t up to much these days so I’m sure you can improve it.

I’ve also run some benchmarks with good results. Populating int->int mappings from a 31-bit PRNG I find it takes 2.6s with OCaml’s Hashtbl, 2.6s with my hash table, 3.2s with my concestor Map and 12.3s with OCaml’s Map. YMMV.

The main caveat with concestor collections is diverging lineages. They work well provided you are primarily operating on the most recent version of a collection and occasionally reverting back to an old version. If the work load involves many switches between different versions of a collection then you may want to clone the collection so the different lineages are backed by different hash tables.

For further work I’d like to see how much further this idea can be taken and, in particular, the impact it has on garbage collection. The concestor here uses a simple purely functional singly-linked list of changes between versions. That could be bootstrapped from a pure list that is itself a concestor collection backed by an extensible array. If this idea is taken to the extreme I believe a language relying heavily upon purely functional data structures could be relatively fast with only a simple mark-sweep GC, not generational. Could the compiler even apply the transformation to a concestor itself to some extent? Data structures equivalent to lists are common: could they be automatically backed by extensible arrays? Finally, what can be done with regards to integrating these collections with the garbage collector itself. The mutators wouldn’t need to zero-out removed data: the GC would understand the collection enough to not follow those pointers.

7 Likes

I only followed some of your code. Would it be possible to change the default behavior so that upon trying to read or modify an earlier version, the entire hashtable is immediately duplicated? Because to me, that’s the most interesting/useful behavior. In practical code, we almost never want to share different versions of data structures, and yet functional programming gives us this (mostly unused) capability everywhere (and we pay for it in performance). But when we do share, we want that sharing to be useful i.e. to allow for full modification capability afterwards on that shared item, and I think copying the entire data structure is ultimately more efficient for this purpose.

Of course, there’s also the matter of GC. We have to keep everything around for the entire lifetime of the data structure - even things that were deleted - since we don’t know if sharing took place at some point. A gc function is probably needed for long-lived structures.

Note in passing: people interested in this topic may want to have a look at Semi-persistent data structures by Sylvain Conchon and Jean-Christophe Filliatre, 2008. It describes a middle-point between “transient/mutable” and “persistent/immutable” when you can go back to old versions of the datastructure at any point, but using it invalidates all later versions. This restricted form of persistence suffices for backtracking search and other uses-cases, and can be more efficiently implemented than full persistence.

7 Likes

You could but I’m not sure you’d want to.

Once upon a time I experimented with counting how many times each diff had been reversed. My plan was to clone when the number of diffs reached a quota relative to the size of the collection but I couldn’t find any practical applications that behaved that way so I binned the idea.

I think there are two common applications:

  1. Linear: use a single lineage so there is no difference between mutable and immutable.
  2. Back tracking: exhaust the current lineage and revert to an older version for development never using the previous lineage again.

Few applications do something else, e.g. lexing with Antimorov derivatives uses many lineages simultaneously.

A key feature of this approach is that only the diffs since the concester of all live versions are live (hence the name “concestor”). Everything else is dead and can be collected by the GC. This is a great trade-off:

  1. Optimal for linear use in the presence of a generational GC because all diffs are collected by the nursery.
  2. Optimal for backtracking because only chains of diffs directly back to remembered versions are kept.

Once upon a time I tried using change sets to compact long chains of diffs but the performance was significantly worse in the common case because the compactions take time.

I tried many other ideas including reference counting and using weak references and semi-weak chains of diffs. None of them came close to this design in the general case.

Would be interesting to drop this in as a replacement for Map in the compiler and see how it fares…

I’m sure it’s occurred to you to do the most obvious hack of making a “map” be a pair of a hashtable and a reference to an avltree (or whatever) ? Viz. ('a, 'b) Hashtbl.t * ('a, b) avltree ref ? Then you can “freeze” the map by just taking the avltree pointer. inserts/deletes go to both; lookups that hit in the hashtbl don’t need to go to the avltree. This means misses are more-costly, but for many applications, misses are much less common than hits. And of course, with some mods, you can make misses as cheap as hits.

The above (of course) is aimed at applications where reads vastly outnumber updates.

That hadn’t occurred to me at all! :slight_smile:

Not sure I entirely understand how that would work efficiently. Could you pull some code together and we can study the performance characteristics?