"Parsing" terms into a well-typed representation: a GADT puzzle

Greetings everyone,

The haskell reddit had a GADT programming puzzle yesterday, and I wrote a solution in OCaml which I thought could be worth sharing. (For a primer on GADTs, see the tutorial in the manual.)

The problem

The author has a GADT that represents well-scoped, well-typed lambda-terms, which is nice as it captures the structure of the data very well.

type ('e, 'a) idx =
  | Zero : (('e * 'a), 'a) idx
  | Succ : ('e, 'a) idx -> ('e * 'b, 'a) idx

type ('e, 'a) texp =
  | Var : ('e, 'a) idx -> ('e, 'a) texp
  | Lam : (('e * 'a), 'b) texp -> ('e, 'a -> 'b) texp
  | App : ('e, 'a -> 'b) texp * ('e, 'a) texp -> ('e, 'b) texp

With this representation, 'e represents a type environment, and 'a represents the type of a term. For example, fun (x : 'a) -> fun (y : 'b) -> x would be written as Lam (Lam (Var (Succ Zero))), whose type is ('e, 'a -> 'b -> 'a) texp: in any environment 'e, this term has type 'a -> 'b -> 'a. The result of the function y is represented by Var (Succ Zero), of type (('e * 'a) * 'b, 'a) texp: in an environment that extends 'e with a first variable of type 'a and a second variable of type 'b, this term has type 'a. Zero means “the last variable introduced in the context”, and Succ Zero means “the variable before that”, so here the variable of type 'a. (This representation of variables by numbers, with 0 being the last variable, is standard in programming-language theory, it is called “De Bruijn indices”.)

Problem: we have seen how to express a fixed, well-typed term, but how could we turn an arbitrary term provided at runtime (say, as a s-expression or a parse-tree) into this highly-structured implementation?

Implementing a parser for lambda-terms is rather standard, but here we are trying to do the next step, to implement a “parser” from a standard AST to this well-typed GADT representation.

Suppose we start from the following representation, which may have been “parsed” from some input string from a standard parser:

type uty =
  | Unit
  | Arr of uty * uty

type uexp =
  | Var of int
  | Lam of uty * uexp
  | App of uexp * uexp

Can we write a function that converts an uexp (untyped expression) into a ('e, 'a) texp (typed expression) for some 'a?

If you want to take this post as a puzzle for yourself, feel free to stop here and try to solve the problem. In the next section I’m going to explain just the high-level details of my solution (types and type signatures), so you can still have fun implementing the actual functions. The post ends with my full code.

Sketch of a solution

Singleton datatypes

To “parse” an untyped expression into a well-typed GADT, we are in fact implementing a type-checker. We can think of implementing a type-checker without any GADT stuff: we need to traverse the type, maintain information about the typing environment, and sometimes check equalities between types (for the application form App(f, arg), the input type of f must be equal to the type of arg). Then the general idea is to do exactly the same thing, in a “type-enriched” way: our code needs to propagate type-level information to build our GADT at the same time.

For example, instead of an “untyped” representation of the environment, that would be basically uty list, we will use a GADT-representation of the environment, with the same runtime information but richer static types:

type 'a ty =
  | Unit : unit ty
  | Arr : 'a ty * 'b ty -> ('a -> 'b) ty

type 'e env =
  | Empty : unit env
  | Cons : 'e env * 'a ty -> ('e * 'a) env

Notice in particular how 'a ty gives a dynamic/runtime value that encodes the content of the type 'a. (I made the choice to restrict the language type system to a single base type, Unit; we could add more constants/primitive types, but embedding any OCaml type would be more difficult for reasons that will show up soon.)

Typeful equality checking

We cannot write OCaml code that checks, at runtime, whether 'a and 'b are the same type, but we can check whether two values of type 'a ty and 'b ty are equal. In fact, when they are, we can even get a proof (as a GADT) that 'a = 'b:

(* a value of type ('a, 'b) eq is a proof that ('a = 'b) *)
type (_, _) eq = Refl : ('a, 'a) eq

exception Clash
(* ensures that (a = b) or raises Clash *)
let rec eq_ty : type a b . a ty -> b ty -> (a, b) eq = ...

Existential packing

Our type-checking function will get a type environent 'e env and an untyped expression uexp, and it should produce some ('e, 'a) texp – or fail with an exception. But if the uexp is produced at runtime, we don’t know what its type 'a will be. To represent this, we use an “existential packing” of our type ('e, 'a) texp:

type 'e some_texp = Exp : 'a ty * ('e, 'a) texp -> 'e some_texp

The type 'e some_texp morally expresses exists 'a. ('e, 'a) texp; this is a standard GADT programming pattern.

Notice the 'a ty argument, which gives us a runtime witness/singleton for the type 'a: 'a is unknown, but we have a dynamic representation of it that we can use for printing, equality checking etc. (And our unknown 'a is restricted to the subset of types that can be valid parameters of 'a ty.) This is a standard extension of the standard pattern, which one may call “existential packing with dynamic witness”.

In fact, we will need “existential packings” of some other GADTs that
will be dynamically produced by our type-checker.

type some_ty = Ty : 'a ty -> some_ty
type 'e some_idx = Idx : 'a ty * ('e, 'a) idx -> 'e some_idx

We can “parse” an untyped uty value into a well-typed 'a ty value, in fact its existential counterpart some_ty:

let rec check_ty : uty -> some_ty = function
  ...

Type-checking functions

Given a type environment 'e env, we can “typecheck” an untyped variable (De Bruijn index) of type int into a well-typed representation ('e, 'a) idx for some unknown 'a determined at runtime.

exception Ill_scoped
let rec check_var : type e . e env -> int -> e some_idx =
  fun env n ->
    ...

If the integer n is out of bounds (negative or above the environment size), the function raises an Ill_scoped exception. It is not standard to use untyped exception for error-handling in this kind of programs, but extremely convenient – it lets us write let (Idx (ty, var)) = check_var env n in ..., instead of having to both with options, result or some other error monad. There is not enough information in our exceptions to provide decent error messages, but who needs decent error messages, right?

Finally we can write the main typechecking function for expressions:

exception Ill_typed
let rec check : type a e. e env -> uexp -> e some_texp =
  fun env exp ->
    ...

Example:

# check Empty (Lam (Unit, Lam (Unit, Var 1)));;
- : unit some_texp =
Exp (Arr (Unit, Arr (Unit, Unit)), Lam (Lam (Var (Succ Zero))))

Full code

(* well-typed representations *)
type ('e, 'a) idx =
  | Zero : (('e * 'a), 'a) idx
  | Succ : ('e, 'a) idx -> ('e * 'b, 'a) idx

type ('e, 'a) texp =
  | Var : ('e, 'a) idx -> ('e, 'a) texp
  | Lam : (('e * 'a), 'b) texp -> ('e, 'a -> 'b) texp
  | App : ('e, 'a -> 'b) texp * ('e, 'a) texp -> ('e, 'b) texp

let example = Lam (Lam (Var (Succ Zero)))

(* untyped representations *)
type uty =
  | Unit
  | Arr of uty * uty

type uexp =
  | Var of int
  | Lam of uty * uexp
  | App of uexp * uexp

(* singleton types to express type-checking *)
type 'a ty =
  | Unit : unit ty
  | Arr : 'a ty * 'b ty -> ('a -> 'b) ty

type 'e env =
  | Empty : unit env
  | Cons : 'e env * 'a ty -> ('e * 'a) env

(* existential types *)
type some_ty = Ty : 'a ty -> some_ty
type 'e some_idx = Idx : 'a ty * ('e, 'a) idx -> 'e some_idx
type 'e some_texp = Exp : 'a ty * ('e, 'a) texp -> 'e some_texp

(* dynamic type equality check *)
type (_, _) eq = Refl : ('a, 'a) eq
exception Clash
let rec eq_ty : type a b . a ty -> b ty -> (a, b) eq
    = fun ta tb -> match (ta, tb) with
    | (Unit, Unit) -> Refl
    | (Unit, Arr _) | (Arr _, Unit) -> raise Clash
    | (Arr (ta1, ta2), Arr (tb1, tb2)) ->
      let Refl = eq_ty ta1 tb1 in
      let Refl = eq_ty ta2 tb2 in
      Refl

(* "checking" a type (no failure) *)
let rec check_ty : uty -> some_ty = function
  | Unit -> Ty Unit
  | Arr (ta, tb) ->
    let (Ty ta) = check_ty ta in
    let (Ty tb) = check_ty tb in
    Ty (Arr (ta, tb))

(* "checking" a variable *)
exception Ill_scoped
let rec check_var : type e . e env -> int -> e some_idx =
  fun env n ->
    match env with
    | Empty -> raise Ill_scoped
    | Cons (env, ty) ->
      if n = 0 then Idx (ty, Zero)
      else
        let (Idx (tyn, idx)) = check_var env (n - 1) in
        Idx (tyn, Succ idx)

(* "checking" an input expression *)
exception Ill_typed
let rec check : type a e. e env -> uexp -> e some_texp =
  fun env exp ->
    match exp with
    | Var n ->
      let (Idx (ty, n)) = check_var env n in
      Exp (ty, Var n)
    | Lam (tya, exp') ->
      let (Ty tya) = check_ty tya in
      let (Exp (tyb, exp')) = check (Cons (env, tya)) exp' in
      Exp (Arr (tya, tyb), Lam exp')
    | App (exp_f, exp_arg) ->
      let (Exp (ty_f, exp_f)) = check env exp_f in
      let (Exp (ty_arg, exp_arg)) = check env exp_arg in
      begin match ty_f with
        | Unit -> raise Ill_typed
        | Arr (ty_arg', ty_res) ->
          let Refl = eq_ty ty_arg ty_arg' in
          Exp (ty_res, App (exp_f, exp_arg))
      end
18 Likes

I just would like to extend a bit your fantastic tutorial with a “toy” which does what you explain: GitHub - mirage/mirage-lambda: An eDSL for MirageOS apps It’s an unikernel which takes a lambda-calculus (via protobuf) and try to map it into a GADT - however, this lambda-calculus is much more complex than yours. Then, it executes the given GADT “safely” and returns the result. The most interesting part is this file:

Note that it’s an old toy (and I don’t have enough times to upgrade it - but if someone want, I will happy to merge their PRs). The initial idea was to take a lambda-calcul such as 1ml and map it to a GADT to finally use malfunction to emit OCaml bytecode and make my new best toy language.

However, I did not yet make a time machine to save my time and continue such interesting side-project :slight_smile: !

5 Likes

Thank you, @gasche, for this post! I found it very pleasant to read.

Surprisingly, while ('e, 'a) texp is guaranteed to be well-typed, the type information carried by uty is lost: I have then the feeling that converting back ('e, 'a) texp to uexp should perform some kind of type inference (which is guaranteed to succeed). Am I right?

I suspect that you are right, but that this complication is an oversight of the proposed presentation. I think that the well-typed version of Lam should also take an input-type annotation to make this much easier.

This is interesting, thank you for going through the whole thing. This relates to a problem I’ve had in the back of my head for awhile but haven’t gotten to the point of really trying to solve:

I have a GADT based SQL library where you annotate your SQL query and then get a function you can apply that is properly typed relative to your annotations. This is great for static queries but at some point I’d like to support taking a query from the user and generating safe SQL query and executing it (like a search bar in an app). So I want to go from text to a SQL query and then apply the query to that. I think this gives me some ideas to noodle on.

thanks for the awesome information.