Getting rid of variant constructors

Disclaimer: This is a copy of a StackOverflow question I posted yesterday. I’m posting it here too for extended discussion.

As a side project, I try to implement the basics of an RDF library in OCaml.

As you may (or may not) know, a RDF statement (or triple) is composed of 3 parts:

  • The subject can be an IRI or a blank node;
  • The predicate must be an IRI;
  • The object can be an IRI, a blank node or a literal.

I have module and types for IRIs, blank nodes and literals, and in order to type-proof the rules described above, here is what I’ve started to write:

(* In `` *)
type subject = Iri of Iri.t | Bnode of Bnode.t
type objekt = Iri of Iri.t | Bnode of Bnode.t | Literal of Literal.t

type t = subject * Iri.t * objekt

let create s p o = s, p, o

So this is nice and everything, but one thing grinds my gears: whenever I want to use Triple.create, I must explicitly state the constructor of the variant:

let iri = (* Some Iri.t value *) in
let literal = (* Literal.t value *) in
Triple.create (Iri iri) (Iri iri) (Literal literal)

I’m pretty sure OCaml has ways to work around that, but I’m not sure how.

Some thoughts: I could parameterize the Triple.t type with the type of its subject and the type of its object, but then how do I enforce the restrictions on the parameter types? Or maybe it is a good use case for a GADT?

Someone in the comments also suggested using OOP for this particular case, but I don’t know how I feel about this. These types don’t have much in common other than the fact that they can be used interchangeably in a RDF triple.

You can’t do such a thing directly without the constructors, no. With subtyping you could use ``reuse’’ the constructors, but that’s all you could achieve, and you would still need a value constructors for proper subtypes (i.e. object constructors or polymorphic variants).

For example

val create : [`Iri of Iri.t | `Bnode of Bnode.t] -> [`Iri of Iri.t] ->  [`Iri of Iri.t | `Bnode of Bnode.t | `Literal of Literal.t] -> ...

Hmm, I see.

As a library user, would you mind having to specify a constructor in cases like this? Can you think of similar examples in the wild?

Why do you want to get rid of constructors? There are plenty of options here, including GADT, phantom types, etc. Can you explain what you’re trying to achieve, like ideally?

I feel like constructors add an unneeded overhead to the final syntax. Maybe this is because of my Ruby background, where typing is so loose.

Ideally, something along these lines would be awesome:

let bnone = (* Some Bnode.t value *) in
let iri = (* Some Iri.t value *) in
let literal = (* Some Literal.t value *) in

let some_triple = Triple.create iri iri literal
let other_triple = Triple.create bnode iri literal
let another_triple = Triple.create iri iri bnode

(* These invalid examples would get caught by type-checking *)

let invalid_triple = Triple.create literal iri iri (* Literal.t cannot be a subject *)
let another_one = Triple.create iri bnode iri (* Bnode.t cannot be a predicate *)
let yet_another_one = Triple.create bnode literal literal (* Literal.t cannot be a predicate *)

So again, we have many options here, depending on how you will design or redesign your RDF representation. If we will start with your particular representation, we can see that you have an rdf construction function with a bounded polymorphism, i.e., there are certain combinations of types that are accepted, and certain that are not. We can easily express this using a GADT:

type (_,_) valid =
  | II : (Iri.t , Iri.t) valid
  | NN : (Bnode.t , Bnode.t) valid
  | IN : (Iri.t , Bnode.t) valid
  | NI : (Bnode.t , Iri.t) valid
  | IL : (Iri.t , Literal.t) valid
  | NL : (Bnode.t , Literal.t) valid

We basically enumerated all possible combinations of the input parameters, which are acceptable. Since we currently don’t have a notion to speak about RDF elements in general (i.e., we don’t have an abstraction for an rdf element) we have to range our constructor over the set of all type constructors (i.e., there is nothing that tells us that Iri.t has any relation with Bnode.t for example).

Now, we can use a GADT to represent RDF which will be constrained by the type that is a member of our set of acceptable pairs of types :

type ('a,'b) rdf = Rdf : ('a,'b) valid * 'a * 'b -> ('a,'b) rdf

So now the create function is very simple:

let create c x y = Rdf (c,x,y)

Without any wrapping to the arguments and with the type ('a, 'b) r -> 'a -> 'b -> ('a, 'b) rdf, that basically says, that if there exists a and b s.t. (a,b) valid then there exists('a,'b) rdf. And we also provide 6 axioms, that state that the following set of pairs (out of the infinte set of all possible pairs of all tyes) form a valid representations of RDF:

  (Iari.t , Iri.t),
  (Bnode.t , Bnode.t),
  (Iri.t , Bnode.t),
  (Bnode.t , Iri.t),
  (Iri.t , Literal.t),
  (Bnode.t , Literal.t)

This solution basically reifies ad-hoc polymorphism into a GADT constraint. So now, we have to implement overloading for each possible variant of our multimethod, e.g.,

let show : type s o. (s,o) rdf -> string = function
  | Rdf (II,x,y) -> x ^ y
  | Rdf (NN,x,y) -> x ^ y
  | Rdf (IN,x,y) -> x ^ y
  | Rdf (NI,x,y) -> x ^ y
  | Rdf (IL,x,y) -> x ^ y
  | Rdf (NL,x,y) -> x ^ y

The only difference with the conventional overloading is that a user has to provide a witness that shows that the overload is valid (or that the constructed polymorphic type is member of the set of types which we are ready to accept).

If we will revisit our design, we will find one thing that strikes immediately: why do we have to enumerate over all possible types, like (int,float) or (string, int list) while we actually are concerned only about three players: iri, literal, and node? So it looks like that we need an extra abstraction that will say that a type is an element of RDF. Moreover, we can actually claim that we have only three sorts (i.e., three non-intersecting sets) of elements: node, literal, and iri. This knowledge should also be captured in our model. Without further ado, here is how we can express our model in OCaml:

module type Rdf = sig
  type t
  type 'a elt

  val create : [< `iri | `node] elt -> [< `iri | `node | `literal] elt -> t

  module type S = sig
    type sort
    val create : string -> sort elt
    val show : sort elt -> string

  module Iri : S with type sort = [`iri]
  module Bnode : S with type sort = [`node]
  module Literal : S with type sort = [`literal]

Now let’s provide some stub implementation:

module Rdf : Rdf = struct 
 (* trust zone begins *)
  type 'a elt = string
  type t = string * string

  module type S = sig
    type sort
    val create : string -> sort elt
    val show : sort elt -> string

  module Stub(T : sig type t end) = struct
    type sort = T.t
    let create x = x
    let show x = x

  module Iri = Stub(struct type t = [`iri] end)
  module Bnode = Stub(struct type t = [`node] end)
  module Literal = Stub(struct type t = [`literal] end)

  let create x y = x,y

And now we can check how it works, provided an iri and literal

let hello = Rdf.Iri.create "hello"
let world = Rdf.Literal.create "world"

A user can use them in a correct order, e.g.,

let good = Rdf.create hello world

typechecks, while

let bad = Rdf.create world hello

complains with

This expression has type Rdf.Literal.sort Rdf.elt
       but an expression was expected of type
         ([< `iri | `node ] as 'a) Rdf.elt
       Type Rdf.Literal.sort = [ `literal ] is not compatible with type
         [< `iri | `node ] as 'a 
       These two variant types have no intersection

Alright, this introduces a fair load of new things for me, but I think I’m okay with that. Thanks a lot for the write-up!

I’ll try implementing that as soon as I have a moment, and I’ll let you know how it goes. :slight_smile:

I think I’m missing something important here: how do I connect this with my acutal implementation of Iri.t, Literal.t and Bnode.t?

For instance, you wrote

module Rdf : Rdf = struct 
  type 'a elt = string
  (* ... *)

But I can’t provide a real implementation of a triple term since it’s basically a sum type of IRIs, blank nodes and literals.

The more I look at this, the more confused I get.

By the way, you can take a look at the code here

Sorry, I didn’t notice your reply.

If it is a sum, then you have to implement some kind of a sum type, and since RDF is a product, then you need to use some kind of a product type in the implementation. Depending on your style or need you may employ different techniques for implementing sums and products of types, the most common would be to use Algebraic Data Types, possibly generalized (later on this), but you shall keep in mind that it is not the only way. You may also opt into abstract data types. For example, when you have a sum of two concepts, (lets pick int and float as an example) then you can represent their sum using the sum type:

type number = Int of int | Float of float

let add x y = match x,y with
  | Int x, Int y -> Int (x+y)
  | Float x, Float y -> Float (x +. y)
  | Float x, Int y -> Float (x +. float y)
  | Int x, Float y -> Float (float x +. y)

you may also use GADT to expose extra constraints, e.g., suppose we would like the add operation to be monomorphic, i.e., we would like to prevent from adding floats to ints and vice verse. You can’t express this with normal ADT, but GADT allows for it

type _ number = Int : int -> [`Z] number | Float : float -> [`R] number

let add : type s. s number -> s number -> s number =
  fun x y -> match x,y with
    | Int x, Int y -> Int (x + y)
    | Float x, Float y -> Float (x +. y)

Notice, that not only the typechecker won’t allow you to add a float to int, but it will also accept the pattern as exhaustive (and even more will disallow you to write any other cases, as they do not fit the constraint).

However, ADT is not the only solution, you may also use Abstract Data Types (unfortunately abbreviated also to ADT), again OCaml provides many techniques for implementing them starting from straightforward records:

type 'a number = {
  v : 'a;
  add : 'a -> 'a -> 'a

let float v = {v; add = (+.)}
let int v = {v; add = (+)}

let add {v=x; add} {v=y} = {v=add x y; add}

going to much more heavyweight first class modules (notice type annotations):

module type Number = sig
  type t
  val v : t
  val add : t -> t -> t

type 'a number = (module Number with type t = 'a)

let add : type t. t number -> t number -> t number =
  fun (module X : Number with type t = t)
    (module Y : Number with type t = t) ->
    (module struct
      include X
      let v = X.add X.v X.v

and finally going to object and classes:

class type ['a] number = object
  method value : 'a
  method add : 'a number -> 'a number -> 'a number

class r x : [float] number = object
  method value = x
  method add x y = new r (x#value +. y#value)

class z x : [int] number = object
  method value = x
  method add x y = new z (x#value + y#value)

Now, after this showcase, let’s get back to your example. But keep in mind, that I’m not trying to show you the way, but instead, I’m showing you different ways, so that you can build your own intution and pick the right way. In other words, I’m trying to tell you how to implement instead of what to implement. Anyway, back on track :slight_smile:

So you have a nice interface that is easy to use correctly and impossible to use incorrectly. Now the easy part - you need to implement it. When we choose between algebraic representation and abstract representation we have to consider our design constraints and fit them into the expression problem (described better here). If we will more likely add more operations then we have to stick with algebraic data types, if we more likely will add more entities, but the behavior is fixed, then we need to use abstract data types. If we are going to add both, we need to use the Tagless-final approach.

Fortunately, our case is simple, since we have a fixed set of objects, that we’re not going to extend, namely iri, node, and literal. Hence, we will stick to algebraic data types.

The next question is whether we should use plain ADT for our elt ty[e, and keep the type parameter phantom, or use constrained GADT. The first option is usually easy to implement, but you can’t rely on the type inference to check that your code is correct. Basically, the type system will admit whatever you will tell it about your phantom types. And since constraints are erased from ADT the exhaustiveness check will be overly pessimistic so you will be forced to spill quite a few assert falses around your code. It doesn’t sound like a lot of fun, but sometimes, this is the only way to implement the signature (GADT power is still limited), so let’s look a this implementation:

  type _ elt = (* type parameter is not used on rhs - thus it is a phantom *)
    | I of iri
    | N of node
    | L of literal

   (* we don't want the constraint to escape to the rdf type, so we may use 
      GADT  here to implement an existential. We can use first class modules,
      objects,  records with universal quantifiers, but GADT is the least 
      verbose here *)
  type t = Rdf : _ elt * _ elt -> t

   (* create is easy, but it loses the constraint that we've specified in 
      the signature *)
  let create x y = Rdf (x,y)

  (* hence the typechecker expects unexpectable *)   
  let show (Rdf (x,y)) = match x,y with
    | I x, I y -> x ^ y
    | N x, N y -> x ^ y
    | I x, N y -> x ^ y
    | N x, I y -> x ^ y
    | I x, L y -> x ^ y
    | N x, L y -> x ^ y
    (* nothing else is posible, but the typechecker doesn't know this *)
    | _ -> assert false 

Fortunately, we can use GADT to keep our constraints,

  type 'a elt =   (* the constraint is propagated to the branches *)
    | I : iri -> [`iri] elt
    | N : node -> [`node] elt
    | L : literal -> [`literal] elt

  (* and kept alive in our existential *) 
  type t = Rdf : [< `iri | `node] elt * [<`iri | `node | `literal] elt -> t

  (* create is still the same easy *)
  let create x y = Rdf (x,y)

  (* but show is now exhaustive *)
  let show (Rdf (x,y)) = match x,y with
    | I x, I y -> x ^ y
    | N x, N y -> x ^ y
    | I x, N y -> x ^ y
    | N x, I y -> x ^ y
    | I x, L y -> x ^ y
    | N x, L y -> x ^ y
    (* and we can't even add an impossible combination here *)

To summarize, OCaml is a very versatile language and provides many different techniques which you can employ to model your problem as tight as possible. Mastering all the techniques takes time, an investment that will be returned once you will start to apply them in the real world scenarios. A properly chosen abstraction will save you a lot of time in the software lifecycle.