As a learning exercise while studying 2-3 finger trees, I’ve implemented the data structure in a handful of different languages. It turns out that the OCaml version is (thus far) the most readable while still retaining a high degree of type safety and efficiency while staying within the standard idioms. (Moreover, I didn’t even have to use any non-standard extensions–I ended up using type families in Haskell in a misguided attempt to avoid multi-parameter type classes.)
My main requirements in the design are to:
- Minimize repetition. I don’t want to have to copy-paste large swaths of code to adapt to special cases, handle the leaf versus node type specially, etc.
- Minimize memory bloat by avoiding things like storing a measurement function, etc., in every (sub)tree.
- Maintain as much type safety as possible to prevent misuse and bugs during implementation. For example, it should be impossible to attempt to concatenate two finger trees with different internal measure functions, even if they use the same measurement type representations.
I went through several designs and ended on a functor-based implementation that essentially uses an associated type for the monoidal measure (along with a measure function on values) to represent the structural requirements on elements. Internally, the nodes are represented using Peano-style GADTs (where the original Haskell paper uses first-class polymorphic types). As I see it, these decisions meet the above requirements in a very satisfying way. For example, the use of a coupled measure
type ensures that measurements can’t be mixed and matched unintentionally, while the functor application allows static dispatch of the measurement function itself (at least in theory–I’m not sure how OCaml compiles this under the hood). Similarly, the GADT representation could in principle compile inner node types to untagged variants under the hood with an optimizing compiler (again, not sure how realistic this is today).
Here’s the module interface:
module type Measurable = sig
type t
type measure
val zero : measure
val ( + ) : measure -> measure -> measure
val measure : t -> measure
end
module type S = sig
type elem
type measure
type t
type split = t * t
val empty : t
val is_empty : t -> bool
val prepend : elem -> t -> t
val append : elem -> t -> t
val pop_left : t -> (elem * t) option
val pop_right : t -> (elem * t) option
val concat : t -> t -> t
val measure : t -> measure
val split_at : (measure -> bool) -> t -> split
val of_seq : elem Seq.t -> t
val to_seq : t -> elem Seq.t
val to_list : t -> elem list
val debug_string : (elem -> string) -> (measure -> string) -> t -> string
end
module Make (M : Measurable) :
S with type elem = M.t and type measure = M.measure
You can find a sketch of the implementation here.
Sadly, the functor design appears to make it impossible (or at least very difficult) to expose polymorphic APIs on top of this data structure while hiding the complexities of functors, FCMs, etc. My next exercise was to try to expose a “vector” API on top of the existing implementation. This is a less-general data structure that supports fast access to both ends and efficient random access and split/concat, but not arbitrary search predicates over measures. (Think of this as specializing the Measurable
structure above to type measure = int
and making the element type polymorphic.) Ideally, its signature would look something like the following:
type 'a t
type 'a split = 'a t * 'a t
val empty : 'a t
val is_empty : 'a t -> bool
val prepend : 'a -> 'a t -> 'a t
val append : 'a -> 'a t -> ' t
val pop_left : 'a t -> ('a * 'a t) option
val pop_right : 'a t -> ('a * 'a t) option
val concat : 'a t -> 'a t -> 'a t
val size : _ t -> int
val split_at : int -> 'a t -> 'a split
val at : int -> 'a t -> 'a
val set : int -> 'a -> 'a t -> ' t
(* Various conversions... *)
val of_seq : 'a Seq.t -> 'a t
val to_seq : 'a t -> 'a Seq.t
val to_list : 'a t -> 'a list
I’m willing to relax this and admit additional type parameters on the top-level wrapped type (e.g., for measure witnesses, brands, etc.), but the idea is to reuse the original general finger tree functor implementation as-is and expose a friendly, flexible API for this use case. If at all possible, I also want to completely hide any first-class modules or functor shenanigans that may be happening under the hood. It should just “feel” like a regular list or array, but with fast random access with full persistence.
Is this possible with the current design or does the API need to be rewritten so it’s amenable to polymorphic access? For inspiration, I’ve looked briefly at how Base does things for polymorphic maps, but I haven’t grasped exactly how it works and it’s not clear how I could adapt it to work with the associated measure
type anyway.
For a bit more context, I’ve tried a bunch of shims on top so far, but nothing has supported the full gamut of operations I want. The closest solution I’ve found involves using a GADT type for the abstract 'a t
above and tying the internal type (Finger_tree.S.t) to an existential type that gets wrapped into a first-class module along with the data structure instance. This is unsatisfactory for two reasons:
- It forces you to introduce a
unit
parameter during construction, i.e.,val empty : 'a t
becomesval empty : unit -> 'a t
. This is a bit annoying, but not so much of a deal breaker if you can get the compiler to infer that allVector.t
instances that were built up from that same initial seed share an internal representation and can be unified. - It completely breaks all functions that take multiple vector instances and try to interoperate them. For example, no matter how I spelled the type by reparameterizing or adding details to the packed module, I was unable to get the compiler to unify the two module types given to
concat
.