Narrowing variant types alternatives

Consider the following type representing types in a programming language:

type t =
  | Name of string
  | Record of (string * t) list
  | Array of (t * int)

Is there a way to limit the type t in Array of (t * int) to be only a Name or Array? i.e. I would like the following to cause a compile error:

let _ = Array (Record []) in
(* ... *)

so that Array types can only be built from Name or Array constructors. Currently I find myself with a few special conditions during runtime that cannot possibly happen but that I need to check for exhaustiveness, because I don’t know how to encode a more strict representation.

I hope I explained myself, I’m a beginner to the language. Thanks.

1 Like

Yes, you can use GADT exactly for this purpose

type scalar = Scalar
type vector = Vector
type _ t = 
  | Name : string -> scalar t
  | Record : string * _ t -> vector t
  | Array : scalar t * int -> scalar t

Probably, I chose the wrong names for type constraints :slight_smile:

Another option would to use polymorphic variants, but this will require some restructuring of your original design. For an example, you can look into the C type system representation in BAP.


]I’ve never used 'em, but my understand was,] This is what private constructors were supposed to be for, right?

@ivg’s GADT suggestion is probably the most ergonomic one, but for completeness, here are two others, an ADT one, and one using polymorphic variants (as also suggested by @ivg).

module ADT = struct
  type not_record =
    | Name of string
    | List of t
    | Array of (not_record * int)
  and t =
    | Record of (string * t) list
    | Not_record of not_record

module Polymorphic_variants = struct
  type 'r not_record = [
    | `Name of string
    | `List of 'r (* any t is allowed *)
    | `Array of 'r not_record * int

  type 'r t_open = [
    | `Record of (string * 'r) list
    | 'r not_record

  type t = t t_open

  let _ : t = `List (`Record ["foo", `Array (`List (`Record []), 10)])

  (* Does not type-check:
  let _ : t = `List (`Record ["foo", `Array (`Record [], 10)]) *)

The disadvantage of the ADT solution is that you need the Not_record constructor, but smart constructors could help in this case.

The disadvantage of the polymorphic variant solution is that it only catches your invariant violation if the value is type-annotated.


Not really, they could be used as a part of a solution which is more practical than GADT.

What’s wrong with GADT?

The problem with GADT is that they mix two orthogonal concepts - types and data representation (aka data types). While ADT define data representation, constraints, attached to GADT constructors, are pure types, as their only purpose is to constrain the set to which the type belongs. On one hand it is very nice, we can kill two rabbits at once, construct data and prove its correctness by construction. But this comes with a price because you have to repeat the proof every time you reconstruct your data. In other words, you can’t serialize and deserialize your GADT, as the constraint itself is not serializable (because it is not data).

Therefore, in real life, where you occasionally need to pass your AST around different hosts, store them in cache and files, etc, using GADT immediately becomes a nuisance. As every time when you need to read your AST from a string, you have to repeat the process of proving by construction. In other words, you need to write a typed parser, that does parsing (data processing) and typing simultaneously. This is notoriously hard and, in fact, is only possible in a small number of simple cases. E.g., it is practically1 impossible to reconstruct an arrow type from its string representation,

    type 'a app = 
      | Unit : unit app
      | App : 'a * 'b app -> ('a -> 'b) app

Separate the concerns

Therefore, a practical solution would be to separate the concerns back and use ADT for data types, and module types for types. In other words, we will hide our data representation behind an abstraction wall. And we may even use the private keyword to allow others to peek into our privates through the holes in our abstraction wall.

The approach basically mimics the GADT approach, except that we move the constraints into the module type,

module Types : sig = 
  type scalar
  type vector
  type data = private
    | Name of string
    | Record of (string * data) list
    | Array of (data * int)

  type 'a t = data
  val name : string -> scalar t
  val record : (string * _ t) list -> vector t
  val array : scalar t -> int -> scalar t
end = struct 
  type scalar
  type vector

  type data =
    | Name of string
    | Record of (string * data) list
    | Array of (data * int)

   type 'a t = data

   let name x = Name x
   let record elts = Record (elts)
   let array elt sz = Array (elt,sz)

Now, our invariant is protected, so that nobody outside of the module could create an incorrect tree. However, inside the module, behind the wall, we are free to do whatever we want, since the constraint is now a phantom type, the compiler will allow us to ascribe any type to any tree. Of course, it will allow us to ascribe wrong types, so it is now our responsibility to implement the type checker correctly. But now we can read our data from text and give it back the type constraint. We can even reify the type constraint into a concrete data type representation and store it with AST.

In this solution we use the private keyword to expose some of our abstraction. So that a user could use a regular pattern match, however this highlights a small problem. When we turned away from GADT we lost an important feature, as beyond proving the data type is constructed correctly, GADT enables us to use this proof when we deconstruct the data. With our plain ADT representation, which is now impossible to construct incorrectly (using the Types interface), we still have to deal with an extra case, when we do

function Array (elt,n) -> match elt with
  | Name s -> name s
  | Array (elt',m) -> array elt m
  | Record _ -> assert false

Since we don’t have the constraints anymore, the compiler can’t refute the Record branch (and indeed, an incorrect code inside of the Types module can construct such representation, so we can’t blame the compiler for that.

And although assert false is more or less fine here, we don’t want a casual user to rely on our internal invariants (an even know about them because an invariant, that leaks the module is no longer an invariant). The reason why the invarinat escaped is because we exposed our internal representation, we leaked it through the abstraction via the private keyword. Therefore, the solution is to hide it,

module Types : sig = 
  type scalar
  type vector
  type _ t 

  val name : string -> scalar t
  val record : (string * _ t) list -> vector t
  val array : scalar t -> int -> scalar t

  val case : _ t -> 
     scalar:(scalar t -> 'a)
     vector:(vector t -> 'a) -> 'a

  val match_scalar : scalar t -> 
      name:(string -> 'a) ->
      array:(scalar t -> int -> 'a) -> 'a

  val match_vector : vector t ->
     record:((string * _ t) list -> 'a) -> 'a

end = struct 
  type scalar
  type vector

  type data =
    | Name of string
    | Record of (string * data) list
    | Array of (data * int)

   type 'a t = data

   let name x = Name x
   let record elts = Record (elts)
   let array elt sz = Array (elt,sz)

   let case t ~scalar ~vector = match t with
     | Record _ -> vector t
     | Array _ | Name _ -> scalar t

   let match_scalar t ~name ~array  = match t with
    | Array (elt,sz) -> array elt sz
    | Name s -> name s
    | _ -> assert false

  let match_vector t ~record = match t with 
    | Record elts -> record elts
    | _ -> assert false

We confined our invariant inside our module, and we can use assert false which could be triggered only if we bugged our code inside of the Types modules, which is now considered the trusted kernel.

The users of our module now can use the whole power of constraints, and basically, use our typed AST as it was GADTs but without any pain associated with them.

Final Solution, Tagless Final

Finally, we can notice, that cases in our matches totally mimic the types of our branch constructors and this smells like a generalization opportunity. Indeed, we can abstract our abstraction into the module type

(* keep constraints not abstract, and make them provably distinct in
    case if someone would like to use GADT in their representations.  *)

type scalar = private Scalar_constraint
type vector = private Vector_constraint

module type Types = sig
  type _ t

  val name : string -> scalar t
  val record : (string * _ t) list -> vector t
  val array : scalar t -> int -> scalar t

(* our old Types module is just _a_ representation, let's call
   it an AST *)
module Ast : sig
  include Types

  val case : 'a t ->
    scalar : (scalar t -> 'a) ->
    vector : (vector t -> 'a) -> 'a

  module Analysis(L : Types) : sig
    val scalar : scalar t -> scalar L.t
    val vector : vector t -> vector L.t
end = struct
  type data =
    | Name of string
    | Record of (string * data) list
    | Array of (data * int)

  type 'a t = data

  let name x = Name x
  let record elts = Record (elts)
  let array elt sz = Array (elt,sz)

  let case x ~scalar ~vector = match x with
    | Name _ | Array _ -> scalar x
    | Record _ -> vector x

  module Analysis(L : Types) = struct

    let rec scalar = function
      | Name s -> s
      | Array (elt,sz) -> L.array (scalar elt) sz
      | Record _ -> assert false

    let rec vector = function
      | Name _ | Array _ -> assert false
      | Record ((_, (Name _ | Array _)) :: _ as elts) ->
        L.record @@ (fun (name,fld) -> name, scalar fld) elts
      | Record elts ->
        L.record @@ (fun (name,fld) -> name, vector fld) elts

And now we can implement the analysis as simple as,

module Sizeof = Ast.Analysis(struct
    type 'a t = int
    let name _ = 4
    let array elt sz = elt * sz
    let record elts = List.fold_left (fun s (_,sz) -> s + sz) 0  elts

let sizeof t = t

This approach is called Taggles Final Style and is well-explored.

A few grains of salt

Although it all may look nice (or vice verse), there is a grain of salt. It doesn’t work as we have a problem in our original design, we dropped the type constraint of the record constructor. In other words, our record constructor, (string * 'a t) -> scalar t is non-injective, as it projects different types into the same type. Therefore we can’t reconstruct the constraint back. Therefore, probably a better type system representation would be

type array = array
type 'a tuple = Tuple

type 'a t
val array : array t -> int -> array t
val tuple : 'a t -> 'b tuple t -> ('a -> 'b) tuple t
val unit : unit tuple t

and the case analysis will look something like this

val case : _ t ->
  unit : (() -> 'a) ->
  tuple : (('a -> 'b) tuple t -> 'a) ->
  array : (array t -> 'a) -> 'a
val split : ('a -> 'b) tuple t -> 'a t * 'b tuple t

1) You can do this for a fixed number of constraints, e.g., int -> unit, int -> string -> unit, etc, not for arbitrary constraint. You would need a backtracking parser, which will try on each branch all possible cases. As a result, not only the performance will suffer, but the complexity of the parser itself.


First, thank you so much for the incredibly detailed answer. I have learnt a lot and I’m still trying to understand many of the ideas, thanks to you. I appreciate it a lot.

A few questions, based on what I (mis)understood so far:

  1. The tagless final paper 1 mentions removing interpretive overhead as one of the objectives of the technique: however in your example there’s still assert false, even if it’s hidden behind the abstraction whereas in their examples the impossible cases are non-existent and transformed into compile errors in the metalanguage?
  2. Aren’t scalar and vector type tags anyway?
  3. If compiling/interpreting a relatively lightweight language, is the performance gain still meaningful or is the pattern still worth over plain ADTs due to other reasons?
  4. Besides serialization issues, what other disadvantages exist with GADTs?
  5. The paper mentions “These techniques require just a Hindley-Milner type system (…)”: why is this strictly required? Would other type inference algorithms have trouble to express the same idea?

Thanks again!

Update: I tried to mimic what the paper does and came up with the following:

module type Types = sig
  type 'dv repr

  type basic
  type compound

  val name : string -> basic repr
  val arr : int -> basic repr -> basic repr
  val record : (string * basic repr) list -> compound repr
  val func : (string * basic repr) list -> basic repr -> compound repr

module Examples (T : Types) = struct
  open T

  let t1 = name "int"
  let t2 = arr 4 (name "float")
  let t3 = arr 4 (arr 8 (name "bool"))
  let t4 = record [("x", name "int")]
  (* Does not compile *)
  (* let t5 = arr 2 (record [("y", name "float")]) *)
  let t6 = func [("a", name "float")] (arr 2 (name "int"))
  (* Does not compile *)
  (* let t7 = func [("b", t6)] (name "void") *)

module Print = struct
  type 'dv repr = 'dv

  type basic = string
  type compound = string

  let name s = s
  let arr n t = Printf.sprintf "[%d]%s" n t
  let record fields =
    let fs = (fun (id, t) -> Printf.sprintf "%s: %s" id t) fields in
    Printf.sprintf "record {%s}" (String.concat "; " fs)
  let func args ret =
    let args = (fun (id, t) -> Printf.sprintf "%s: %s" id t) args in
    Printf.sprintf "func (%s): %s" (String.concat ", " args) ret

module Sizeof = struct
  type 'dv repr = 'dv

  type basic = int
  type compound = int

  let name _ = 1
  let arr n t = n * t
  let record fields = List.fold_left (fun acc (_, t) -> acc + t) 0 fields
  let func _ _ = 0

module EP = Examples (Print)
module ES = Examples (Sizeof)

let () =
  (* Print *)
  print_endline EP.t1 ;
  print_endline EP.t2 ;
  print_endline EP.t3 ;
  print_endline EP.t4 ;
  print_endline EP.t6 ;
  (* Sizeof *)
  print_endline (string_of_int ES.t1) ;
  print_endline (string_of_int ES.t2) ;
  print_endline (string_of_int ES.t3) ;
  print_endline (string_of_int ES.t4) ;
  print_endline (string_of_int ES.t6)

Would this be considered tagless final?


There are several different concepts involved here. First of all, the place, where I’m using the assert false statement, is not actually an example of the final tagless approach, it is sort of a counter-example. One of the mind challenging ideas of the true tagless final, is that there is no pattern matching and you have to (and as they prove even can) write analysis, including constant folding, in the final style. You example, is the proper showcase of the final style, and as you can see it doesn’t involve any overhead.

Second, even in the true final style, in general, you will still be having occasional assert falses here and there. This will happen when the signature is more constrained than the representation, and this will, nearly always, be the case (otherwise, there will be no reason to have the signature). Therefore, you will have to deal with cases that are constructible with your representation, though are not possible, since they are not included in the algebra that is defined by your signature. Those assert falses will not introduce any overhead, since they will be representing code which is not reachable. Of course, there is always a place for a bug. It could be that your signature is wrong and it in fact allows for construction ill-formed terms, or it could be an implementation bug, where a particular representation breaks an invariant. The former bugs, could be actually eliminated, if we will invoke the Curry–Howard isomorphism and provide an implementation for your signature S, that doesn’t use assert false or other diverging terms. The signature here denotes your theory (a set of theorems), while the implementation will be the proof, that neither theorem will produce an ill-formed (i.e., the assert false) term. For all, but the trivial signatures, you will end up using GADT. And if you haven’t been using diverging terms (which include non-terminating constructs, like general recursion), your structure would be the constructive proof that your signature is correct, for all possible implementations.

Finally, even in my example, those assert falses didn’t introduce any overhead, because they were unreachable. However, I would like to empathise that that example is not in the final form. It is one step before the final.

No. Here the word “tag” has a very concrete meaning. It refers to the implementation technique, as well as a general concept, called tagged union. In which the data interpretation is determined by a tag value (a number basically), which is stored in the runtime representation of data. For example, when you have

type figure = Rectangle of float * float | Circle of float

then every value of type figure will have a runtime tag which will discriminate whether it is a rectangle or a circle. Unlike, a tagless representation, where this information is erased, e.g., when an inheritance is used to implement ADT.

In short, tagless means that OCaml variant types (ADT, GADT, extensible, and polymorphic variants) are not used.

In our example scalar and vectors are not reified into concrete types and are erased during compilation (this is actually the reason why you can’t safely reconstruct them back from a serialized form, they do not have any runtime representation, hence there is no proof that they had existed in the typing environment).

With all that said, I would also like to point that the final tagless form doesn’t prohibit you from using the tagged form in the representation. So it doesn’t ban pattern matching or anything like this. It is the representation of the term (the syntax) which is tagless, while the interpretation of a term (the semantics) could be anything you like. And when you will investigate more complex analysis in the final form, you will find yourself using different ASTs for representing your optimization information.

I’m not really sure that this performance gain exists. OCaml variants are extremely fast, and the main benefit of the final tagless representation is not the performance (although it should not actually introduce any overhead). The main idea of the final tagless representation is that it provides a solution to the expression problem, i.e., it enables extensibility of your language in both aspects, you can add new terms (extend the syntax) and new interpretations (extend the semantics) without having to change existing implementations. Therefore, if your language is fixed (you are not expecting any new forms, branches in your language), then you can use plain ADT. Alternatively, if your language is expected to vary, but the interpretation is fixed, then using some kind of inheritance would be a solution (e.g., just representing every term as a function ctxt -> value will enable great extensibility of the syntax, but will render any analysis impossible – no extension of semantics).

The main disadvantage is that it is impossible to infer the most general type of GADT expression. In other words, you have to provide type annotations. Besides being syntax heavy and challenging in most but the trivial cases, it also renders unusable the now popular and heavily used part of OCaml infrastructure – ppx derivers. Basically, things like [@@deriving compare] and the friends will not work for GADT.
Also, the variance inference for GADT is incomplete in OCaml (it means that it is less precise than it could be), so when you’re substituting a covariant ADT type with a corresponding GADT to introduce extra constraints, you often (not always) loose the covariance property. Which, in turn, triggers value restriction and an unnecessary loss of polymorphism.

Otherwise, GADT are just plain tagged ADT with extra typing rules.

No. Here “just a” is crucial, and it means that the technique doesn’t require anything fancy, such as GADT or dependent types, just pure syntactical unification. In fact, it doesn’t even require global type inference, as it works in Scala, nor it requires type inference at all, therefore the pattern is applicable in C++ and Java, where it is known under a different name, Object Algebras.

Yep, perfectly tagless and final. Except that the following representation of types will conceptually have more sense

  type 'dv repr = int
  type basis
  type compound

Is this pattern considered more generally useful beyond embedding DSLs? e.g. is it reasonable to write a “traditional” optimizing compiler based on this pattern rather than using the deep embedding?