Trees that Grow: Extending and annotating the AST

The Haskell Trees that Grow (TTG) idiom can be used to provide different forms of extensions and variations on an AST, something that GHC is using actively.

How would I implement this in OCaml?

I’m thinking of using a list of polymorphic variants, e.g.

type annotation =
  [ `Location of int * int
  | `Type of ...
  | ...
  ]

then leaving an annotation field in each node and combining that with GADTs, somehow.

Is this the way?

TL;DR

Polymorphic variants yield weak typing disciplines and provide minimal extension capabilities w.r.t. the original Trees That Grow specification. These two issues can be solved through a combination of phantom tagged GADTs, module functors and recursive module definitions.


Earlier this year, I had a toy compiler project where I used this exact idiom to implement the AST. I used a mix of recursive modules and functors to achieve this, mainly for three reasons:

  1. The AST had MANY distinct components (think expressions, patterns, declarations, binds, signatures, …), and while most of them are mutually recursive, I wanted to keep the code bases for each component separate
  2. Although I’m not opposed to using polymorphic variants, I find they either introduce a lot of noise to the code base through extensive or duplicate type signatures, or reduce (or at least result in a weaker form of) type safety, which I find dangerous in a non-closed data structure such as an extensible tree.
  3. Due to the compiler pipeline design, I had multiple stages which used mostly the same AST components with minimal alterations. Much alike Haskell, I had three stages reusing the same base AST structure (parser, renamer, type checker), with each stage augmenting or completely turning off nodes from each component.

For these reasons, I choose to implement each component in separate modules as functors dependent on each extensible and/or recursive point, and then joining all of them at the top-level.


Extension points and pipeline stages

To handle the distinct n compiler stages, I had n phantom tag types which parameterize each AST component type, as well as each extension point type. Note that these types serve only as tags, and thus should not contain any data, nor should you be able to construct values of such types.

type stage_parser  = private | 
type stage_renamer = private | 
type stage_type    = private | 

Extension points are then modules with a single unary GADT, which indicates the types of metadata accepted by each node at each compiler pipeline stage. Since I had 3 stages, this type could be defined as

module Ext = sig
  type ps
  type rn
  type tc
  type 'a t = 
    | PData: ps -> stage_parser  t (* Parser data *)
    | RData: rn -> stage_renamer t (* Renamer data *)
    | TData: tc -> stage_type    t (* Type checking data *)

  (* Any constructor, destructor or manipulation functions defined for all extension points *)
end

To aid in creating these extension points, I used a functor parameterized by the metadata of each compiler stage

module MkExt = functor 
  (P: sig type t end)
  (R: sig type t end)
  (T: sig type t end) -> struct

  type ps = P.t
  type rn = R.t
  type tc = T.t
  type 'a t = 
    | PData: ps -> stage_parser  t (* Parser data *)
    | RData: rn -> stage_renamer t (* Renamer data *)
    | TData: tc -> stage_type    t (* Type checking data *)
end

With this extension design alone:

  1. “turn off” a node at some pipeline stage, by adding an extension point to that node where the metadata for that stage is set to the empty variant type type t = |
  2. add metadata to a node only at certain stages, by setting the metadata at other stages to the unit type
  3. Add special constructors, destructors, mappers and other functions for each pipeline stage.

An additional idea I did not end up implementing is the idea of combining metadata from multiple stages. For instance, location information could be created during parsing, but propagate to later stages. I believe this could be accomplished by a different functor, one which adds the parsing data as additional components to the later stages:

module PropagateExt = functor
  (P: sig type t end)
  (R: sig type t end)
  (T: sig type t end) -> MkExt
    (struct type t = P.t end)
    (struct type t = P.t * R.t end)
    (struct type t = P.t * T.t end)

An example

Take for instance, the following minimal AST types:

type ('id, 'op, 'con) expression = 
  | Var of 'id
  | Int of int
  | Float of float
  | Bool of bool
  | BinOp of 'op * ('id, 'op, 'con) expression * ('id, 'op, 'con) expression
  | If of ('id, 'op, 'con) expression * ('id, 'op, 'con) expression * ('id, 'op, 'con) expression
  | Let of ('id, 'op, 'con) bind list * ('id, 'op, 'con) expression
  | Lambda of ('id, 'con) pattern * ('id, 'op, 'con) expression
  | App of ('id, 'op, 'con) expression * ('id, 'op, 'con) expression

and ('id, 'con) pattern = 
  | Var of 'id
  | WildCard
  | Constructor of 'con * ('id, 'con) pattern list
  | As of 'id * ('id, 'con) pattern

and ('id, 'op, 'con) bind = 
  | PatBind of ('id, 'con) pattern * ('id, 'op, 'con) expression
  | FunBind of 'id * ('id, 'con) pattern list * ('id, 'op, 'con) expression

If you’re wondering why I chose these particular parameters for these components:

  • id: during parsing, you only know the name of a variable. However, after type checking, you can augment this with type, usage and constraint information, the last two being very useful if you are implementing some exotic type system.
  • con: follows the same idea of id. During parsing, you only know the constructor name, but in later stages, you may know its argument types, return type, the parent, type constructor, …, which you may want to directly inject in the AST itself
  • op: could serve two functions here. On one hand, it can be useful if some of the infix operators can be expressed as functions of other operators, and thus may be removed from later stages (as a very crude example, maybe you could optimize the power operator x ** n using only inlined multiplication x * x * ... * x). On the other hand, because infix operators are syntax sugar for binary function application, you can completely turn-off this constructor after type checking, by setting ‘op to the empty variant type.

This entire AST can be described in terms of mutual recursion points, defined as functor parameters. The components themselves become parameterized only by the compiler stage types. Each component and extension point can then reside in different .ml files, with recursion being handled at a top-level .ml file.

(* Extension.ml *)
module Id = MkExt
  (struct type t = string end)
  (struct type t = string end)
  (struct type t = string * Type.t * Multiplicity.t end)

module Con = MkExt
  (struct type t = string end)
  (struct type t = string end)
  (struct type t = string * Type.t list * Type.t * Tycon.t end)

module WC = MkExt
  (struct type t = unit end)
  (struct type t = unit end)
  (struct type t = Type.t end) (* Post type checking, any wildcard becomes aware of its type *)

module Op = MkExt
  (struct type t = unit end)
  (struct type t = unit end)
  (struct type t = | end) (* Any node using Op as an extension point becomes impossible post type checking *)
(* Expr_i.ml *)
open Extension

module type I = sig
  type _ bind

  type 'pass t = 
    | Var of 'pass Id.t
    | Int of int
    | Float of float
    | Bool of bool
    | BinOp of 'pass Op.t * 'pass t * 'pass t
    | If of 'pass t * 'pass t * 'pass t
    | Let of 'pass bind list * 'pass t
    | Lambda of 'pass Pattern.t * 'pass t
    | App of 'pass t * 'pass t
end

module Mk = functor
  (Bind: sig type 'pass t end) (* You can pass additional terms, but the minimal definition for recursion points is the type itself *) -> struct
    type 'pass t = 
    | Var of 'pass Id.t
    | Int of int
    | Float of float
    | Bool of bool
    | BinOp of 'pass Op.t * 'pass t * 'pass t
    | If of 'pass t * 'pass t * 'pass t
    | Let of 'pass Bind.t list * 'pass t
    | Lambda of 'pass Pattern.t * 'pass t
    | App of 'pass t * 'pass t
end
(* Pattern.ml *)
open Extension

type 'pass t = 
  | Var of 'pass Id.t
  | WildCard of 'pass WC.t
  | Constructor of 'pass Con.t * 'pass t list
  | As of 'pass Id.t * 'pass t
(* Bind_i.ml *)
open Extension

module type I = sig
  type _ expression

  type 'pass t = 
    | PatBind of 'pass Pattern.t * 'pass expression
    | FunBind of 'pass Id.t * 'pass Pattern.t list * 'pass expression
end

module Mk = functor
  (Expr: sig type 'pass t end) -> struct
  type 'pass t = 
    | PatBind of 'pass Pattern.t * 'pass Expr.t
    | FunBind of 'pass Id.t * 'pass Pattern.t list * 'pass Expr.t
end
(* AST.ml *)
module Pattern = Pattern
module rec Expression: Expr_i.I with type 'pass bind := 'pass Bind.t = Expr_i.Mk(Bind)
and Bind: Bind_i.I with type 'pass expression := 'pass Expression.t = Bind_i.Mk(Expression)

Now, this is still a lot of boilerplate code, well more than I had just used polymorphic variants. However, I find the code more well organized, more abstract (I suddenly don’t have to change n different variant types because I chose to add some property on some extension point) and more strongly typed. I also find it really useful to turn-off nodes at whim, organize metadata on a per-stage basis and add extension nodes to each component in a type-safe manner. Of course, you should evaluate the tradeoffs of such solutions. If your AST is smallish or a prototype, I would not advise going for such a complex solution. In such case, I would also advise not going for the Trees That Grow approach at all either way, but rather simply copying the AST will additional metadata or some nodes removed, for instance.

1 Like

I’m gonna have to reason through this…