Modules that extend modules from third-party packages

package-design
#1

What is the recommended way to structure modules that add extensions onto other modules that come from external packages (over which you have no control)?

How do you then structure those modules so that they can themselves be extended, and/or that the extensions can be taken out separately and, say, applied on top of other extensions or modifications to those libraries (say, if I target Base, being able to apply the extensions to Core_kernel)

Currently I’m doing something like this, where I want to add things to a module in Base:

(* base_exts/bar.mli *)

include module type of Base.Bar
module Extensions : sig
  val foo : t -> Something.t -> Something_else.t -> t Base.Or_error.t
end
include module type of Extensions

(* base_exts/bar.ml *)

include Base.Bar (* [!] *)
module Extensions = struct
  let foo (bar: t) (baz: Something.t) (barbaz: Something_else.t)
    : t Base.Or_error.t = (* do something *)
end

If I then want to re-apply the same extensions to the Core_kernel version of baz, I’ll just import Base_exts.Bar.Extensions over an include of Core_kernel.Bar. The extensions still depend on the Base implementation of everything, but that shouldn’t matter as long as the t agrees?

It occurs to me (after trying to publish an opam package with this setup, natch) that this might be a Very Bad Idea:

  • That include may very well be copying the whole body of Base.Bar into my module. (I’m not sure how includes work, but that’d be the most semantically obvious thing to do.) I definitely don’t want to be distributing half of, say, Jane Street’s libraries in my opam packages, for obvious infrastructural and legal reasons! I’ve seen parts of Core_kernel include parts of Base like this, but it may be that, since they’re two parts of the same library family, this is OK to do in that situation.
  • odoc seems to be picking up the entire API surface of Base.Bar when I do the above. This certainly isn’t what I want—I want to be loosely saying 're-export everything that Base.Bar exports`, not ‘re-export this specific thing and this specific thing and then this specific thing’, especially since the latter ties the documentation to a specific version of the external library.

So far I’ve seen two other approaches:

  1. Don’t re-export anything, just provide extensions to be used as a separate module alongside the original one. This is what I used to do, and what I think Core_extended and its various spinoff libraries do?, but I got sick of having to remember which function was in Base.Bar and which was in Base_exts.My_bar. In hindsight this was probably an acceptable compromise (and avoids having to remember which things are in the other library and which are the extensions), and I might revert back to it.
  2. Core_kernel sometimes includes Base modules in the form
    type t = Base.Foo.t [@@deriving stuff]
    include (Base.Foo : (module type of struct include Base.Foo end with type t := t))
    
    I’m not sure whether the indirection of hiding Base.Foo's type inside another module type has any purpose other than enabling the re-declaration of t, but, if so, is this relevant to what I’m looking at?

Other suggestions very welcome :slight_smile:

1 Like
#2
  1. fork the project
  2. extend the module
  3. (optionally, but necessary) submit the patch upstream (aka PR)

Not really the answer you were looking for? Then read below. Modules in OCaml are not extensible, they are closed structures, like final classes in Java, that are not extensible by design. OCaml modules are not namespaces. OCaml doesn’t have namespaces1 and modules are not substitution for the namespaces. Trying to use modules as namespaces will leave both parties unhappy, you and OCaml.

Yes, it is harsh, and namespaces is the feature I miss the most when I’m developing large programs in OCaml2. However, let’s look deeper into the program model of OCaml to understand why this is happening and is there a right way to code in OCaml and be happy.

There are two kinds of modules in OCaml, structures and functors. Your question is more about the former. OCaml is a language of mathematics, where structures denote algebras, i.e., tuples of functions attached to a set. In mathematics there is only one algebra of integers. You can’t have Janestreet’s arithmetics, Matt’s arithmetics, or Ivan’s arithmetics. If you do, then those are different algebras with different laws, and therefore they have different structures. In other words, OCaml wasn’t really designed that way, it is the essence of mathematics, our vision of mathematics that we, the humanity, have developed so far. OCaml just inherited this approach, no more no less. And this is where mathematics clashes with its offspring - programming. Yes, as software developers we need namespaces, as we need to reuse software components developed by others, we want to build systems from packages, like engineers are building complex structures from existing building blocks. Not something that mathematics is really offering us, instead it gives us the theory of categories and homotopy type theory, that are quite orthogonal to the design patterns of software engineering.
The main difference of programming as a branch of mathematics is that it has a much lower entrance barrier (you do not need to learn category theory to program) and is much more rapidly developing3. Like it or not, but programming is still mathematics and therefore we have to play by the rules of mathematics.

With all that said, you can still develop software and apply all modern software design patterns in OCaml. Just keep in mind, that a module is not a namespace, not a package, not a component. It is a mathematical structure which is fixed. It is a tuple of values. So keep those values as they are and build new values from existing, rather than trying to destructively substitute them. But before we start to explore the design space, I need to bring here two asides, so that we can develop some context for reasoning.

Aside: The OCaml program model

It would be interesting to look inside of OCaml to understand how modules and functions are actually implemented, what semantics the include statement has and so on. In OCaml the values are not referenced by names, unlike Common Lisp, which is the language that indeed offers proper namespacing. In fact, in OCaml values are not referenced at all, there is no such kind of indirection. Values are passed directly to each other. This is a true call-by-value, do-by-value, apply-by-value language. When we see an expression f x y, it is a value f which is applied to the values x and y. Not a function named f. When we say List.length it is not treated as ["List"; "length"], it is always and directly resolved to a concrete value of the camlStdlib__list_length function, which is a piece of code4. A module, e.g., List is a record (tuple) of pointers. When you do include List you create a new tuple and copy (as with memcpy) the contents of List into the new tuple. When you create an implementation of a compilation unit, in other words, when you compile an ml file, you are actually creating a tuple of values or a structure. The interesting and a very important part here, is that a compilation unit is implicitly parametrized by all modules that occur free in it. In other words, when you create a file example.ml with the following contents,

let list_length = List.length

and compile it to code, then the compiled object module itself will not contain the List.length value. Essentially, example.cmo will be like a functor, which is parametrized by a list implementation. It is only during the linking phase, when an actual implementation of the List module will be applied, and all references to the List.length will be finally resolved to values. On one side, this is just a side-effect of the separate compilation system, on the other side it gives us an opportunity to treat compilation units as software components and build our software systems on this level of granularity. But we are not yet at this phase, despite several recent improvements in the OCaml infrastructure, which include bug fixing in the dynamic linker, module aliasing, new dependency analysis, and, last but not least, Dune; compilation units are still not the building blocks. From the program model standing point, we still are operating with values, not with names.

Aside: Common Lisp, modules, and namespaces

It is also interesting to look into other languages, which provide proper namespaces, like Java, C++, Common Lisp. Let’s pick the latter as the working example. In Common Lisp we have a notion of a symbol, which denotes an object identity. When you call a function, (f x y), you are not applying a value f to x and y, like you do in OCaml, but instead you are passing a symbol f and the runtime extracts a pointer to a function from a specific slot of the symbol object. This is basically the same, as we would be passing references to functions, e.g., if let list_length = ref List.length, and then calling it like !list_length [1;2;3]. This is, in fact, the operational model of languages with namespaces, you never call a function, you call a name of a function and the name is a variable, which changes dynamically (the level of dynamism differs from language to language). There are, of course, cons and pros of this design. The main disadvantage is that it is hard to reason about the program behavior. Because now every program is not a mathematical object built from other mathematical objects, but rather an expression in the theory of names, that have multiple interpretations in the space of the cartesian product of the sets that denote each symbol. In other words, each program term has many interpretations, like what is !list_length [1;2;3]? You can never be sure.

There is also another lesson, that we can learn from languages with namespaces. The lesson is, you still need modules. For example, in Common Lisp5, despite the presence of proper namespaces, programmers are still use names like list-length, but not list:length. Why so? Because list-length denotes an operation in the theory of lists with well defined meaning. It is not just a name, but an abstraction, therefore there could be edu.cmu.ece:list-length or com.janestreet.core:list-length. Therefore, we have an implicit (designed by convention) module list with some well-known names, which define a structure of the list algebra. So the takeaway is – modules and packages are orthogonal.

Design for extension. What choices do we have?

OCaml is a very rich language, that means it has a huge search space for the design choices. It also means, that most likely it is possible to implement any design pattern that you can find in the wild. This design space is not really fully explored (especially since the last years OCaml is rapidly developing) and not all decisions are well accepted by the community. For example, we have classes, which if being adopted by the community, could solve the module extension problem. Imagine, if instead of having the List module we had instead the list class. Now, the extension would be simply an inheritance, and names will be all properly indirected, as now when you will do list#length you will actually reference a symbol which will have multiple interpretations. However, the community didn’t really adopt this design. Well, mostly because it ended up in a nightmare :slight_smile: And it is not really about classes. Classes in OCaml is just an attempt to tame the names problem. You can go rogue and actually use records of functions instead. And even make them references, e.g.,

let 'a list = {
   mutable length : 'a List.t -> int;
   mutable nth : 'a List.t -> int -> 'a;
}

And use it like list.length [1;2;3]. The extension is a little bit hard, as records do not have row-types or an include statement (unlike objects and structures), but enables overriding. This approach is also not extremely popular, but was adopted at least in ppx rewriters.

So, this is all to say, that in OCaml it is possible to adopt any poor choice that was made in the software developing community. Fortunately, they are not very popular (that of course doesn’t prove that they are wrong). So, what is the OCaml way of designing reusable components? Ideally, components that follow the Open-Closed principle. The solution is to design for extension.

Not everything should be designed this way. This would one of those poor choices. Some entities are inherently and fundamentally not extensible. They are algebras. In the ideal world full of mathematicians, that are not bothered with time and its side effects, we should define algebras as their least fixed points (aka initial algebras). For example, the initial algebra of list (i.e., the minimum set of definitions) is its type definition, so the module List shall have only one entry (which denotes two constructors).

module List = struct 
   type 'a t = [] | :: of 'a * 'a t
end

Everything else should be put aside of the List module, because it is secondary, e.g., we can have a component called stdlib_list_basic_operations.ml which you could link into your final solution and use it, which will basically have the following interface

val list_length : 'a List.t -> int
val list_nth : 'a List.t -> int -> 'a
val list_hd : 'a List.t -> 'a

With this approach, it would be easy to compose different components, as there wouldn’t be any more competition for the List module, but instead the list interface will be composed by convention. Anyone could provide a list_something function and it is your choice as the system developer to select the right components and glue them tightly and correctly. This is, basically, the approach that is used in Common Lisp, C++, Java, and other languages.

Unfortunately, this is not the convention in OCaml. While the initial design of the OCaml system distribution exhibits some notions of this approach (cf., string_of_int, int_of_string, and the Pervasives module itself), at some point of time, this venue was abandoned, and OCaml developers sticked to the “blessed module” approach. In this approach, operations are blessed by being included in the main module and all other operations are sort of the second sort citizens. As a result, we have modules with exploded interfaces, which are hard to maintain, use, and it takes so much time to compile programs that use Janestreet’s libraries6.

Summary

Design for extensibility, when the extensibility is expected. Use small modules, which define abstractions. Protect those abstractions. If a function doesn’t require the access to the internal representation, doesn’t rely on the internal invariants of the representation, and could be efficiently implemented using the abstract interface only, then do not put it into the module. Good example, list_length - not a part of the module. But Map.length is, since it needs to access the internal representation of the binary tree. Nor it could be efficiently implemented using Map.fold. Implement all other functions in separate modules, probably structured by their purposes and domains. Use the open statement introduce those names into your namespace, but enable warning 44 and watch for it. Use open! sparingly when overriding is your intention.

When you design a software component that should be extensible, parametrize it with abstractions. Use functors, function parameters, whatever – OCaml gives you a lot of options here. You can even use references to functions, which work especially good with dynamic linking.

When you find someone else’s code which is broken, either because of a missing function in the interface or a wrong implementation do not hesitate to fork, patch, and submit. In fact, Dune facilitates this approach, so that you can create your own workspaces with core, base, and whatever libraries, edit them to your taste and get a working solution. If you want it to be reusable – then push the changes back. And we are back where we started. Yes, you can do the include trick and reexport your own List module, but this is essentially the same as cloning, patching… but not submitting back. Because at the end of the day we will now have Base.List, Core.List, Matt.List, how does it differ from having multiple forks on github or, even worse, vendoring those modules? Essentially, it is the same. So,

  1. fork the project
  2. extend the module
  3. push it back

but before doing this, ask yourself, is the operation that I’m trying to add is really a member of this module?


1) And will never have, because OCaml program model operates on values, not on names, as Common Lisp for example.

2) However, when I develop large programs in other languages I miss OCaml modules much more, than I miss namespaces in OCaml :slight_smile:

3) Programming is like mathematics without elitist approach, which is really good.
4) Don’t worry it is still represented as a pointer, but essentially it is the code.
5) The same is true for Java (classes vs packages) and for C++ (classes vs namespaces vs functions)

6) I’m not blaming anyone, I myself was for a long time an apologet of this approach, and even authored quite a few libraries, like Graphlib, Regular, Bap, etc which features those bloated interfaces. Today, I’m really happy with the direction which Base took. So now, I’m trying the new diet.

11 Likes
#3

This is a really comprehensive and thoughtful answer, and somewhat confirms my suspicions about what I’m doing being a fairly bad idea—thanks!

ask yourself, is the operation that I’m trying to add is really a member of this module?

Generally: no. What I’m doing is trying to insert operations over Lists, etc., that don’t directly depend on the intrinsics of how the lists are defined, but instead back-form implementations of patterns that are specific to the library I’m designing. Effectively, I’m trying to do what you’d do with extension methods in C#, or by defining instances of a new typeclass I’ve made over Base.List. (Indeed, I’m coming to OCaml through C#-via-F# and Haskell, so these are the mental models I’m already hardwired to try implement.)

So, if I consider what I’m doing as ‘here is a List module and I’m just yanking all of Base.List into it while exposing the fact that I’ve done so’, then… of course it makes no sense!

It also makes no sense for me to try to send what I’m doing upstream, because, while I might think that what I’m doing to List is generally useful, it doesn’t make any sense outside the environment that my library is doing, nor is it a key part of List's initial algebra (it might well define algebraic properties on List, but they’re derived ones), and putting it in Base or Core_kernel would be bloat.

I got confused by the fact that Core_kernel really does just sit on top of Base in the way that I was trying to do— but this is a special case that doesn’t generalise to what I want to do at all.

2 Likes