JaneStreet's OCaml Core library's Hashtbl interfaces

Available Hash Table APIs

The question is about which approach to constructing a hash table in Core is preferable.

I noticed that the JaneStreet’s Core Hashtbl library provides several interfaces to creating hash tables. Let’s assume that there’s no interest in polymorphic hash tables. Is there one approach that is most preferred? What is the role of Make functor there? Is the use of this functor a preferred approach? It seems that the use of Key.Table approach is referenced in earlier documentation, and I didn’t notice the presence of Make functor in the earlier versions of the Hashtbl. This suggests that maybe Make functor based approach is a more recent method.

I know that in the base library the Make-based approach has existed more or less from the beginning.

Given that, I don’t understand what the Hashtbl.create is doing there, whence we already have other approaches taking equivalent record of functions. Is this a legacy method?

A couple of similar inquiries appeared on the forum here and here, but none had considered the issue of choosing an interface in the Core.

There’s a type (_ , _, _) Hashtbl_intf.create_options_without_hashable that emerges when a function create in the module generated by the Make is invoked. Now, this looks like a source of concern to me in the sense that hashable-based approach (and thus the Table functor? Just like in the old Base interface?) is suggested to be more normative rather.


Constructing Families of Hash Tables

If I would like to use hash tables that reference other hash tables (e.g. mapping from ids to property lists), is there a difference which interface I use as far as efficiency is concerned?

2 Likes

Hello! That’s a great question thanks for asking. +1 I’m interested to know what someone from JS is going to respond.

Personally, I very much like the take a first class module approach of the containers interface from Base. It seems to go well with a general direction that the language could take in the future (consider for example the work on modular-explicits).

A point of friction I have with functors is that I never really know where to put the invocation in my code, and often end up either with unusual modules that I need to import in some ad-hoc way, or worse, end up with multiple invocations in different places of the functor with the same arguments. Using the first class module design solves this, and is my preferred way at the moment. I wish more currently-functorized libraries would add a support for an interface of that style, when possible!

A word of caution though, I don’t know whether having this cosmetic preference comes with attached downsides. Given that it is not as widely popular as the functor approach, there may be reasons behind it. I’d like to know more about this if someone has given it some thoughts.

One point that may or may not be relevant in your case is that the compiler has an easier time optimizing the functor-based approach, at least according to the last benchmarks I know that focused on this, see here.

3 Likes

I think, the functors could be applied to first-class modules. Given that functors also are dependent types, we actually have a tool that could yield a richer semantics.

What is bizarre to me is the distinction between Hastbl.Make and Key.Table based approaches. My guess is that one of them is a legacy one.

The use of functors conceptually seems to make more sense, especially since apparently the functors could be applied to first-class modules, and thus there’s no restriction when such functionality is required (e.g. in dynamic loading it’s unavoidable).

Thank you for sharing the benchmark results. As a digression, I would like to note that the consistent results of Core, and the slightly lower speed at the higher optimization level, is perhaps explainable by the fact that the Core’s (and I assume Base’s) implementation of Hashtbl uses an array of AVL trees rather than array. In production, that seems to be a much more preferable solution than just an array, regardless of a slight performance drop.

As of v0.17:

  • I can’t find a functor Base.Hashtbl.Make. Looks like Base only exports the first-class module pattern nowadays.
  • If you are comparing solely Core.Hashable.Make (the Key.Table pattern) against Core.Hashtbl.Make my conclusion looking at the code is that Core.Hashable.Make simply packages together several instantiations of hash related functors (Hashtbl, Hash_set and Hash_queue), and gives them standard names (Table, Hash_set and Hash_queue). Core.Hashable.Make actually calls Core.Hashtbl.Make under the hood to build the Table part (if you only need the table one, I suppose there’s no need to build the other 2).
module Make (T : sig
    type t [@@deriving hash]

    include Hashtbl.Key with type t := t
  end) : S with type t := T.t = struct
  include T
  module Table = Hashtbl.Make (T)
  module Hash_set = Hash_set.Make (T)
  module Hash_queue = Hash_queue.Make (T)

  let hashable = Table.hashable
end
1 Like

Right. It’s only available in Core, and used to be available in now-defunct Core_kernel. This fact is one of the reasons why I’ve mentioned deprecation in my original posts.

So, if all I need is just a hash table (to map ids to records or property lists, etc.), then by using Hashtbl.Make, I’ll be getting the same code and same API working as with the Key.Table, which in turn is an invocation of Hashable.Make?

A question arises, which approach would be more idiomatic? For you, but also perhaps for JaneStreet? For the latter case, is there a justification published? Some blog notes? (my Google-fu — which was utilized prior to publishing the original post — returned nothing on the latter :face_with_diagonal_mouth:).

I think so, yes. That’s my understanding of the code, reading a recent checkout of core (>= v0.17, maybe true for older versions but I haven’t checked.)

Thank you. I wonder, why are there two different interfaces in the API in the first place? I mean, it’s relatively trivial to invoke the constructors, right?

The original API is the Hashtbl.Make one from Core, the newer one is the defunctorized one from both Base and Core.

Both are used a lot and are essentially straightforwardly interchangeable. The difference between Foo.Table.create () and Hashtbl.create (module Foo) is whether you pass the information about the key being Foo.t is passed via the function name or a function argument.

The defunctorized API in Base was created for a few reasons:

  • to reduce dependencies between modules. Core.Int contains Core.Int.Table so it must depend on Core.Hashtbl (or Base.Hashtbl perhaps). So if your code uses just one function of Core.Int, you end up linking Core.Hashtbl (and Hash_set, and Map, and Set, and Hash_queue, and whatever they bring along transitively). In Base, Int and Map are independent, and you still don’t need to apply a functor to use an map keyed by ints (unlike with the stdlib).
  • to work better with [@@deriving]. With a list, if I have type t = (A.t * B.t) list [@@deriving of_sexp], that types so long as A.t and B.t have [@@deriving of_sexp], which is ideal (and similarly for any other derivers that are defined recursively on the type). With the Base interface, the same thing is true with type t = B.t Hashtbl.M(A).t [@@deriving of_sexp]. With Core.Hashtbl.Make, it’s not enough for A.t to support of_sexp, you need to add code around the Hashtbl.Make call to say so, which is boilerplate that’s redundant with the definition of A.
  • not applying functors for maps, sets, hashtbls, hash_sets etc is just less boilerplate to write (that boilerplate is reduced with the include functor proposal, but still), less noise in api docs, less space taken in .cmi, .cmt.
  • removing an annoyance with the Core interface, not with Hashtbl, but with Map (and Set). The convention to define Foo.Map as the maps keyed by Foo means that it’s awkward in foo.ml to use maps keyed by something other than Foo (Core.Map is shadowed by Foo.Map). With Base, there’s no Foo.Map, so no problem.

Personally, I only use the defunctorized interfaces mostly for the boilerplate reasons.

3 Likes

Thank you for sharing this. This is very insightful.

May we conclude that the first-class module interface is now considered idiomatic? The JaneStreet’s blogs and other writings accessible through Google do not seem to suggest one way or another. Nor does the API documentation…

May we assume that there’s no cost in efficiency for use of first-class modules? The question is relevant for me also in considering what data representation to use in other parts of my program — that is whether the first-class modules are indeed viable for such purposes.

I think, that in the module generated by functor, there’s still going to be a [@@deriving] annotation already embedded. Is it not transparent with functors? But transparent with first-class?

May we conclude that the first-class module interface is now considered idiomatic? The JaneStreet’s blogs and other writings accessible through Google do not seem to suggest one way or another. Nor does the API documentation…

Both APIs are idiomatic. If you want to know the most idiomatic one, I don’t think that’s settled. The functor API is almost certainly used vastly more.

May we assume that there’s no cost in efficiency for use of first-class modules? The question is relevant for me also in considering what data representation to use in other parts of my program — that is whether the first-class modules are indeed viable for such purposes.

I don’t expect any perf difference between the two APIs. The functors only gives functions to create new hash tables, so your calls to Hashtbl.set and Hashtbl.find will be exactly the same either way. Stdlib.Hashtbl.Make works differently, which I think the other thread was saying is more easily optimized by flambda.

I think, that in the module generated by functor, there’s still going to be a [@@deriving] annotation already embedded. Is it not transparent with functors? But transparent with first-class?

Just take a silly “identity” functor, since there’s nothing Hashtbl-specific here:

module M(X : sig type t end) = struct
  type t = X.t
end
module M_int = M(Int)
type t = M_int.t [@@deriving of_sexp]

That won’t type since M_int.t doesn’t provide [@@deriving of_sexp] even though Int does. You can fix it:

module M(X : sig type t [@@deriving of_sexp] end) = struct
  type t = X.t [@@deriving of_sexp]
end
module M_int = M(Int)
type t = M_int.t [@@deriving of_sexp]

That solves the problem, but you’ve broken all the users of the functor that didn’t care about of_sexp and don’t want to provide it for their key. So an actual solution that supports all cases is:

module M(X : sig type t end) = struct
  type t = X.t
  module Provide_of_sexp(Y : sig val t_of_sexp : sexp -> X.t end) = struct
    let t_of_sexp = Y.t_of_sexp
  end
end
module M_int = struct
  include M(Int)
  include Provide_of_sexp(Int)
end
type t = M_int.t [@@deriving of_sexp]

That way, different callers can ask for the optional functionalities that they have the requirements for, like having of_sexp on the input type to give of_sexp on the output type. But when you apply the functor, you have to state that Int supports of_sexp.

With first class modules, what happens instead is (taking a syntactic shortcut):

module M(X : sig type t end) = struct type t = X.t end
let m__t_of_sexp (type a) (module X : sig type t = a [@@deriving of_sexp] end) sexp =
  X.t_of_sexp sexp

type t = M(Int).t [@@deriving of_sexp]
(* deriving  ends up generating:
   let t_of_sexp sexp = m__t_of_sexp (module Int) sexp *)

The code generated by the final [@@deriving of_sexp] directly requests of_sexp from Int. And users of M who don’t request of_sexp don’t have to provide it for the functor argument.

3 Likes

Thank you. I think both of your last two responses are a resolution to my original inquiry. Is there a suitable way to mark them? (I think, the “Solution” tag is limited to one per post…).

As a side-note, I suppose that the -O3 or flambda (my compiler is always built with flambda) would inline the code properly, even for first-class modules case.

Thank you. I think both of your last two responses are a resolution to my original inquiry. Is there a suitable way to mark them? (I think, the “Solution” tag is limited to one per post…).

No idea!

As a side-note, I suppose that the -O3 or flambda (my compiler is always built with flambda) would inline the code properly, even for first-class modules case.

Inlining would work, but I don’t expect inlining alone to result in a speedup. What presumably helps is having a copy of Hashtbl.find and co with the specific hashing and comparison functions baked in, at least when they keys are cheap to manipulate, like integers. That specialization is easy to apply to Stdlib.Hashtbl.Make, not so much with Base.Hashtbl and Core.Hashtbl.

I tagged the last post. Unfortunately, even a forum presents us with some design limitations to wrestle with… :roll_eyes: :no_mouth:

Thank you for sharing this observation. I’ll use the more conservative functor-based approach for the time being in my code. It become largely idiomatic in practice at this point. I wonder if flambda-2 (cf. here) would tackle this problem better. Maybe, even the issues with efficiency of objects, when the layout is known statically…