How can you "un-functorize" an API to make it polymorphic while retaining type safety and efficiency?

As a learning exercise while studying 2-3 finger trees, I’ve implemented the data structure in a handful of different languages. It turns out that the OCaml version is (thus far) the most readable while still retaining a high degree of type safety and efficiency while staying within the standard idioms. (Moreover, I didn’t even have to use any non-standard extensions–I ended up using type families in Haskell in a misguided attempt to avoid multi-parameter type classes.)

My main requirements in the design are to:

  • Minimize repetition. I don’t want to have to copy-paste large swaths of code to adapt to special cases, handle the leaf versus node type specially, etc.
  • Minimize memory bloat by avoiding things like storing a measurement function, etc., in every (sub)tree.
  • Maintain as much type safety as possible to prevent misuse and bugs during implementation. For example, it should be impossible to attempt to concatenate two finger trees with different internal measure functions, even if they use the same measurement type representations.

I went through several designs and ended on a functor-based implementation that essentially uses an associated type for the monoidal measure (along with a measure function on values) to represent the structural requirements on elements. Internally, the nodes are represented using Peano-style GADTs (where the original Haskell paper uses first-class polymorphic types). As I see it, these decisions meet the above requirements in a very satisfying way. For example, the use of a coupled measure type ensures that measurements can’t be mixed and matched unintentionally, while the functor application allows static dispatch of the measurement function itself (at least in theory–I’m not sure how OCaml compiles this under the hood). Similarly, the GADT representation could in principle compile inner node types to untagged variants under the hood with an optimizing compiler (again, not sure how realistic this is today).

Here’s the module interface:

module type Measurable = sig
  type t
  type measure

  val zero : measure
  val ( + ) : measure -> measure -> measure
  val measure : t -> measure
end

module type S = sig
  type elem
  type measure
  type t
  type split = t * t

  val empty : t
  val is_empty : t -> bool
  val prepend : elem -> t -> t
  val append : elem -> t -> t
  val pop_left : t -> (elem * t) option
  val pop_right : t -> (elem * t) option
  val concat : t -> t -> t
  val measure : t -> measure
  val split_at : (measure -> bool) -> t -> split
  val of_seq : elem Seq.t -> t
  val to_seq : t -> elem Seq.t
  val to_list : t -> elem list
  val debug_string : (elem -> string) -> (measure -> string) -> t -> string
end

module Make (M : Measurable) :
  S with type elem = M.t and type measure = M.measure

You can find a sketch of the implementation here.

Sadly, the functor design appears to make it impossible (or at least very difficult) to expose polymorphic APIs on top of this data structure while hiding the complexities of functors, FCMs, etc. My next exercise was to try to expose a “vector” API on top of the existing implementation. This is a less-general data structure that supports fast access to both ends and efficient random access and split/concat, but not arbitrary search predicates over measures. (Think of this as specializing the Measurable structure above to type measure = int and making the element type polymorphic.) Ideally, its signature would look something like the following:

type 'a t
type 'a split = 'a t * 'a t

val empty : 'a t
val is_empty : 'a t -> bool
val prepend : 'a -> 'a t -> 'a t
val append : 'a -> 'a t -> ' t
val pop_left : 'a t -> ('a * 'a t) option
val pop_right : 'a t -> ('a * 'a t) option
val concat : 'a t -> 'a t -> 'a t
val size : _ t -> int
val split_at : int -> 'a t -> 'a split
val at : int -> 'a t -> 'a
val set : int -> 'a  -> 'a t -> ' t
(* Various conversions... *)
val of_seq : 'a Seq.t -> 'a t
val to_seq : 'a t -> 'a Seq.t
val to_list : 'a t -> 'a list

I’m willing to relax this and admit additional type parameters on the top-level wrapped type (e.g., for measure witnesses, brands, etc.), but the idea is to reuse the original general finger tree functor implementation as-is and expose a friendly, flexible API for this use case. If at all possible, I also want to completely hide any first-class modules or functor shenanigans that may be happening under the hood. It should just “feel” like a regular list or array, but with fast random access with full persistence.

Is this possible with the current design or does the API need to be rewritten so it’s amenable to polymorphic access? For inspiration, I’ve looked briefly at how Base does things for polymorphic maps, but I haven’t grasped exactly how it works and it’s not clear how I could adapt it to work with the associated measure type anyway.

For a bit more context, I’ve tried a bunch of shims on top so far, but nothing has supported the full gamut of operations I want. The closest solution I’ve found involves using a GADT type for the abstract 'a t above and tying the internal type (Finger_tree.S.t) to an existential type that gets wrapped into a first-class module along with the data structure instance. This is unsatisfactory for two reasons:

  • It forces you to introduce a unit parameter during construction, i.e., val empty : 'a t becomes val empty : unit -> 'a t. This is a bit annoying, but not so much of a deal breaker if you can get the compiler to infer that all Vector.t instances that were built up from that same initial seed share an internal representation and can be unified.
  • It completely breaks all functions that take multiple vector instances and try to interoperate them. For example, no matter how I spelled the type by reparameterizing or adding details to the packed module, I was unable to get the compiler to unify the two module types given to concat.

To clarify, I’m fine with rewriting the original module/interface substantially as long as the new design satisfies the main requirements stated above and does not require maintaining two separate implementations.

I think the base solution fits quite well: the idea is that if you have a functor:

type 'a spine
module F(X:S) = struct
   type t = X.t spine
end

where the functor argument provides a collection of functions over a family of non-parametric types, you can internalize this collection of function inside the main type as a first class module (or same difference an existentially quantified record).

However, I think that the problem that you mention with the associated measure type corresponds to a genuine gap in the specification of the API. Does it really make sense to
concat two trees with a different measure, if the answer is no the

val concat: 'a t -> 'a t -> 'a t

does not really make sense and what you wanted is more

val concat: ('a, 'measure_id) t -> ('a, 'measure_id) t -> ('a, 'measure_id) t

which can be achieved with a generative functor creating fresh types for set of function.

In a similar way, the creation function empty cannot be independent of the measure, and its type would be necessary something like

val empty: 'elt measure -> 'elt t

if trees with different measure are compatible, or

val empty: ('elt,'id) measure -> ('elt,'id) t

otherwise.

Thanks for the suggestion! Could you expand more on what shape the functor argument should have? Is this not the entire collection type itself, but an inner building block instead?

In the general case no. I added only a single type parameter to make this suggestive. I’m only trying to parameterize the vector interface where the measure is fixed to int ahead of time (and moreover the measure monoid and weight maps are also fixed).

If I only care about exposing 'a in the polymorphic API, do I still need to parameterize over the measure type itself in that inner functor? (The answer to that question might be obvious if I understood your top-level suggestion, but I’m still unclear on exactly what the “seed” implementation looks like in this case.)

Are you saying that S here captures the structure on elements (element type, measure, etc)? I’m still not sure what to do with the outer type 'a t and how to connect it to the inner spine in this case. Is it a problem if I only want one of those types to become polymorphic (the element)? For the general case, I’m happy with the functor API and want to be able to use it approximately as is.

I’ve been working on the same project. I use destructive type replacement to get it done.

My signature looks like this

module type Measurable = sig
  type +'a elt
  type monoid
  val null : monoid
  val measure : 'a elt -> monoid
  val add : monoid -> monoid -> monoid
end

module type S = sig
  type monoid
  type +'a elt
  type +'a t
  val empty : 'a t
  val measure : 'a t -> monoid
  val fold_right : f:('a elt -> 'b -> 'b) -> 'a t -> init:'b -> 'b
  val fold_left : f:('b -> 'a elt -> 'b) -> init:'b -> 'a t -> 'b
  val map : f:('a elt -> 'b elt) -> 'a t -> 'b t
  val filter : f:('a elt -> bool) -> 'a t -> 'a t
...

And then the functor looks like this:

module Make (M: Measurable)
  : (S with type monoid := M.monoid
        and type 'a elt := 'a M.elt)
= struct
  type +'a elt = 'a M.elt
  type 'a measure = 'a -> M.monoid
...

then when defining the module for Sequence, it looks like this:

include FingerTree.Make(struct
    type 'a elt = 'a
    type monoid = int
    let null = 0
    let measure _ = 1
    let add = Int.add
  end)

And the resulting signature looks like this:

module FtSeq :
  sig
    type +'a t
    val empty : 'a t
    val measure : 'a t -> int
    val fold_right : f:('a -> 'b -> 'b) -> 'a t -> init:'b -> 'b
    val fold_left : f:('b -> 'a -> 'b) -> init:'b -> 'a t -> 'b
    val map : f:('a -> 'b) -> 'a t -> 'b t
    val filter : f:('a -> bool) -> 'a t -> 'a t

So in this case all the appearances of 'a elt are replaced with 'a

This approach is working well for me.

2 Likes

This interface force you to make the measure independent of the leaf elements, which avoids a lot of the design complexity.

@bsidhom , it might be simpler to illustrate on your code. The idea is to try to move the functor arguments at the function or type definition boundary first. For the type definition, this means
transforming the type arguments into type parameters, thus

  type 'n element =
    | Leaf : M.elem -> zero element
    | Branch : 'n node -> 'n succ element

  and 'n node =
    | Node2 : M.measure * 'n element * 'n element -> 'n node
    | Node3 : M.measure * 'n element * 'n element * 'n element -> 'n node
   ...

becomes

  type ('n,'elem,'measure) element =
    | Leaf : 'elem -> zero element
    | Branch : 'n node -> 'n succ element

  and ('n, 'elem, 'measure) node =
    | Node2 : 'measure * 'n element * 'n element -> 'n node
    | Node3 : 'measure * 'n element * 'n element * 'n element -> 'n node
...

For functions, we can group the functions coming from the functor argument into a record (we could call it the class type dictionary).
For instance, this correspond to transforming

  let rec append' : type n e m. (n,e,m) element -> (n,e,m) tree -> (n,e,m) tree =
   fun elem t ->
    match t with
    | Empty -> Single elem
    | Single a ->
        let m = M.(measure_element a + measure_element elem) in
        Deep (m, Digit1 a, Empty, Digit1 elem)
    | Deep (m, left, deeper, right) ->
        let m = M.(m + measure_element elem) in
        ...

into

 type ('e,'m) measure = {
   zero: 'm;
   add: 'm -> 'm -> 'm;
   measure_element: 'a -> 'm
  }
 let rec append' : type n e m. (e,m) measure -> (n,e,m) element -> (n,e,m) tree -> (n,e,m) tree =
   fun md elem t ->
    match t with
    | Empty -> Single elem
    | Single a ->
        let m = md.measure_element a + md.measure_element elem in
        Deep (m, Digit1 a, Empty, Digit1 elem)
    | Deep (m, left, deeper, right) ->
        let m = m + md.measure_element elem) in
        ...

But then, it is bothersome to add this dictionary to every function call.
A natural solution is to pair the measure dictionary with the tree

type ('n,'e,'m) t = { tree: ('n,'e,'m) t; measure: ('e,'m) measure }

but this leads to the problem that functions with two trees argument may end up with two different measure dictionaries. Since this should not be allowed by the specification, this should be a type error. In other words, we need to create abstract types from values. This is can done with a functor:

module Measure: sig
   type ('id, 'm,'e) t
   val add: ('id,'m,'e) t -> 'm -> 'm -> 'm
   val zero: ('id,'m,'e) t -> 'm
   val measure: ('id,'m,'e) t -> 'm
   module Make(M:Measurable): sig
        type id
        value m: (id,M.m,M.e) t
    end
end = struct
   type ('id,'m,'e) t = {
   zero: 'm;
   add: 'm -> 'm -> 'm;
   measure_element: 'e -> 'm
  }
  let add m x y = m.add x y
  let zero m = m.zero
  let measure x = m.measure_element x
  module Make(M:Measurable) = struct
     type id
     let m = M.{add = (+); zero; measure_element }
  end
end 

From a higher-point of view, we are adding this functor as an administrative layer on the top of the record that ensures that there is a single inhabitant for a given measure type ('id,'m",'e) t.

We can then come back the paired definition and add one last type parameter to enforce our requirement that we have only one measure value by type ('e,'m,'id) measure:

type ('n,'e,'m,'id) t = { tree: ('n,'e,'m) t; measure: ('e,'m,'id) measure }

or removing the depth parameter

type ('elt, 'measure, 'id) t =  {
   tree: ('n,'e,'m) t;
   measure: ('e,'m,'id) measure
}

The simple functor can be then implemented by applying the measure functor and fixing the measure and element types.

The vector interface might require a polymorphic extension to the Measure functor:

module Measure: sig
   ...
   module Poly(M:sig type m val (+): m -> m -> m val zero:m val one :m end): sig
        type id
        value m: (id,M.m,'any) t
    end
end

to allow creating the Vector measure instance as

module Int_m = struct type m = int let (+) = (+) let zero = 0 let one = 1 end
module Vec_measure = Measure.Poly(Int_m)
type 'a vector = ('a,int,Vec_measure.id) t
let empty () = empty Vec_measure.m
3 Likes

Thanks! I’m not sure why I didn’t think to make the element type itself polymorphic. I ended up with a similar interface to yours that works basically exactly as desired.

module type Measurable = sig
  type +'a t
  type measure

  val zero : measure
  val ( + ) : measure -> measure -> measure
  val measure : 'a t -> measure
end

module type S = sig
  type +'a elem
  type measure
  type +'a t
  type +'a split = Split of 'a t * 'a t

  val empty : 'a t
  val is_empty : 'a t -> bool
  val prepend : 'a elem -> 'a t -> 'a t
  val append : 'a elem -> 'a t -> 'a t
  val pop_left : 'a t -> ('a elem * 'a t) option
  val pop_right : 'a t -> ('a elem * 'a t) option
  val concat : 'a t -> 'a t -> 'a t
  val split_at : (measure -> bool) -> 'a t -> 'a split
  val of_seq : 'a elem Seq.t -> 'a t
  val to_seq : 'a t -> 'a elem Seq.t
  val to_list : 'a t -> 'a elem list

  val debug_string :
    ('a elem -> string) -> (measure -> string) -> 'a t -> string
end

module Make (M : Measurable) :
  S with type 'a elem = 'a M.t and type measure = M.measure

And the vector wraps it more or less as expected (I’ve updated the gist).

1 Like

Ah, that’s even more general than the solution that I was looking for. Indeed, the problem of mismatched measure functions is exactly the problem I was concerned with in that approach, which is part of the reason I didn’t try to make the measure itself polymorphic (though the approach is interesting). I’m still digesting exactly how you collapse the measure soundly, but I suspect this is close to the approach I would have considered previously.

One thing that still annoys me about the approach where you parameterize over measures is that you still have to retain the auxiliary functions in every retained tree. On it’s face it doesn’t seem too bad because you’re only attaching the “class type dictionary” to top-level trees, but that does result in O(n) extra memory in the context of full persistence with a history of length n (i.e., you’re saving a factor of O(log(n)) from the naive approach of stuffing that into recursive trees). Do you know if there’s a way around this? Keeping the functor approach but specializing the measure to int (as in the current desired use case) handles this nicely, but of course sacrifices flexibility in the measure itself.

OK, after having digested it, I do really like this technique of making it polymorphic over the measure type, especially because it allows the measure to be decoupled from the element type while ensuring uniqueness at the type level. Branding the measures with the id type by making it abstract and then stuffing it into the parameter does this much more cleanly than how I was trying (and failing) to do it with existentials. Is there a name for this general pattern? I would love to find more examples in the wild to get a better sense for the trade-offs that come with it.

I am not sure if this pattern of creating unique type from values through functor have a name?

With the uniqueness guarantee, you can split back the trees and the measure record if you can store the measure record in some context. For instance, if I have arrays with a lot of small trees, I can define an array of trees as:

(* Let"s be explicit with the type names *)
type ('elt, 'measure, 'id) paired_tree =  {
   tree: ('n,'e,'m) real_tree;
   measure: ('e,'m,'id) measure
}

module Array_of_tree: sig
  type ('a,'id, 'm) array
  val (.!()): ('a,'id, 'm) array -> int -> ('a,'id,'t) paired_tree
  val (.!() <- ): ('a,'id, 'm) array -> int -> ('a,'id,'t) paired_tree -> unit
end = struct
  type ('a,'id,'m) array = ('a,'id,'m) measure * ('a,'id,'m) real_tree array
  let (.!()) (m,a) n = { measure = m; tree = a.(n) }
  let (.!()) (_,a) n x = a.(n) <- x.tree 
end

then my array stores the measure only once, while preserving the easy to use interface outside the array. This is also means that one could implement a version of the monomorphic functor that uses the non-paired representation internally and exposes conversion functions to the paired version

module F(M:Measurable) ->
   type id
   type t
   ...
   val export: t -> (M.elt, M.measure, id) paired_tree
   val import: (M.elt, M.measure, id) paired_tree -> t
end = ...

For persistent datastructure, maybe it would be possible to use the unpaired trees in the history, and a notion of active node which reattaches the measure to the tree.

1 Like