Best way to structure an algebraic data type for ASTs?

Howdy! I’m a relative OCaml newcomer, and I’m writing a compiler in it.

My AST data type has the following interesting problem: there are some nodes in the AST that can appear under several sorts of circumstances but which are not valid in all of them.

For example, in this language, a variable definition can appear both at the top level and within a function, but a function definition cannot appear within a function.

However, currently, I have a single variant type for definitions, so the AST could encode an invalid situation where a function definition appeared inside a function definition.

My parser just returns a list of definitions at the top level, and a function definition contains a list of statements, some of which are definitions. This doesn’t allow the AST type to “know” that only certain kinds of definitions can validly appear inside a function.

Technically this isn’t an awful problem, my parser after all doesn’t generate the invalid cases. However, I’d prefer to have the type system itself assure the invalid case couldn’t be accidentally expressed. That is, I’d like to have the type system give stronger assurances that if a parse tree can be expressed, it is valid.

One ugly way to do this would be to break up the definition type into its constituent variants, create two new variants, one at the top level and one at the function level, that will only wrap the kinds of definitions allowed at that level, so one would only allow function definitions in the top level wrapper type. But this seems to be quite ugly looking in practice.

This is only one example, as the problem seems to have crept in several times.

Is there a good or even pretty pattern for doing such things within OCaml’s type system?

I hope this isn’t too obscure without my having given a detailed example from my code. If necessary, I can.

1 Like

Sounds like an interesting problem–when I worked on a compiler in OCaml my AST was not repetitive enough, I guess, for this to be an issue.

I do run into this sort of thing (if I’m understanding you correctly) with errors returned by functions. In that case I use polymorphic variants. Maybe for you this could look like this:

type top_level_definition = [
    `FuncDecl of id * (id, typ) list * typ option
  | definition
  | ...
]

type definition = [
    `VarDecl of id * typ * expression option
  | ...
]

This lets you avoid wrappers, at least. You can also re-use code that handles only definitions (maybe something that extracts used variables)–when you match, you will have to use the weird # syntax:

match (def : top_level_definition) with
| `FuncDecl _ -> ...
| (#definition as def) -> more_specialized_version def

Hope this helps.

Ah, but the problem with using polymorphic variants is I get even less typechecking. I can make mistakes unless I’m really careful about how I craft my code (rather than excluding them if I am thoughtful about how I craft my types). It would, however, allow me to re-use code, as you note, which would be an advantage. Really I would hope for some trick to let me get both the strong typechecking and avoid code re-use while maintaining elegance. I suppose I’m asking for too much, but I’m greedy. :slight_smile:

Hm, I really don’t know what you mean by getting even less typechecking when using polymorphic variants. You should be specifying the types that the various nonterminals are parsed to, in which case you get at least as strong typechecking with polymorphic variants as with regular variants.

Take the case where I do a default match against a polymorphic variant. The compiler can’t tell what the full set of legal variants under the circumstances might be. If I then have a typo in naming a variant somewhere, it won’t catch it. Or am I mistaken?

Oh, I see. Well, when you use regular variants you’re implicitly specifying the type of the parameter (because a particular variant name belongs to one type). With polymorphic variants you need to add the type annotation on the parameter explicitly. If you don’t, though, you’re right that you might accidentally allow unexpected variants.

For me, a refinement of your data-type could be better like:

type a =
  | ALambda of string list * a
  | AUnit
type b =
  | BLambda of string * b
  | BUnit

let rec reduce = function
  | ALambda (args, body) ->
    List.fold_right (fun arg acc -> BLambda (arg, acc)) args (reduce body)
  | AUnit -> BUnit

About polymorphic variant, you can check the exhaustivity in this way:

type intv = [ `Int of int ]
type boolv = [ `True | `False ]

let int_value (v : [ intv | boolv ]) = match v with
  | `Int v -> v
  | #boolv -> raise (Invalid_argument "Expected int value")

But, my best advise is to not use polymorphic variant in this case.

EDIT: note the type annotation in int_value is not required, OCaml can easily infer this. I put it only to precise the example.

Did you try with GADT ? Here is a toy example :

type 'a expr =
| Lit : int -> int expr
| Add : (int expr * int expr) -> int expr (* we can only add int expression *)
| IsZ : int expr -> bool expr (* we can only compare to zero an int expression *)
| If : (bool expr *'a expr * 'a expr) -> 'a expr (* the first argument must be a bool expression *)

(* so we can't write : if 1 then 2 else 3 *)
If (Lit 1, Lit 2, Lit 3);;
Error: This expression has type int expr but an expression was expected of type  bool expr
       Type int is not compatible with type bool

To write an evaluator for this AST read the OCaml manual to understand the type annotation :

let rec eval : type a. a expr -> a = function
| Lit i -> i
| Add (i,j) -> (eval i) + (eval j)
| IsZ i -> if eval i = 0 then true else false
| If (b,e,e') -> if eval b then eval e else eval e';;
val eval : 'a expr -> 'a = <fun>

eval (Lit 1);;
- : int = 1

(* if 1 = 0 then 2 else 3 *)
eval (If (IsZ (Lit 1), Lit 2, Lit 3));;
- : int = 3
3 Likes

I would recommend using this approach. In my experience it’s fine in practice.

type module =
  | Decl of decl
  | Fundecl of fundecl
  | ...
and body =
  | Decl of decl
  | ...
and decl = { var : var; def : expr }
and fundecl = { name : var; params : var list; body : body }

(If you are not comfortable with using type-directed constructor disambiguation, you can use Topdecl for the first Decl constructor to avoid the conflict.)

Polymorphic variants are a way to avoid the small redundancy here, but as you noted their flexibility comes at a cost: you have to carefully annotate your code with closed variant annotations in places to avoid mistakes returning unexpectedly-well-typed code.

GADTs also offer some flexibility, but they also have costs in practice (surprising error messages, etc.). I think it’s best to avoid complex features unless they are really necessary.

7 Likes

I’d like to have the type system give stronger assurances that if a parse tree can be expressed, it is valid.

This is really tough to do in the general case. It is possible to abuse GADTs to express properties such as well-typedness and well-scopedness at the type level, but for real languages this becomes rather ugly and hard to understand.

Unless you have a burning desire for type hackery, I would stick with simple approaches as in gasche’s suggestion. Where you rely on properties that are not convenient to express in OCaml’s type system, consider simply writing a predicate and asserting it.

1 Like

Of course, I forgot to mention it, but the best way to fix your problem is obviously to allow function declarations within function declarations :slight_smile:

3 Likes

You’re rigth. I forgot to read that @perry is a relative newcomer and only focus in the fact he finds the solution to split his definitions an “ugly way”. You’re solution will be easier to use in practice.

1 Like

Well for a newcomer @perry has a surprisingly elaborate view of the tradeoffs there and already a good feeling for polymorphic variants, so I wouldn’t worry about accidental harm to innocent bystanders. It’s more that for newcomers and experts alike I think that we should remember that simple is good.

(We often say that there are three stages with language features: first you know about them, then you learn how to use them, then you learn how not to use them.)

3 Likes

I’d like to have the type system give stronger assurances that if a parse tree can be expressed, it is valid.

This is really tough to do in the general case. It is possible to abuse GADTs to express properties such as well-typedness and well-scopedness at the type level, but for real languages this becomes rather ugly and hard to understand.

Oh, I don’t want to fully express that. That would be hard (and might even be impossible without dependent types.) I’d just like to have as much reasonable assurance as possible. Doing this fully would indeed require far more type hackery than I’m interested in.

What I don’t like about the “simple” approach, among other things, is that it requires that I write redundant code in a few places, but I think from a pragmatic point of view either that or the polymorphic variants approach with tight type annotations is going to be sanest. Of those two, the former is probably more to my taste but I may experiment with both just to get a feel for the styles.

I’m new to OCaml but not to programming, or even functional programming.

And thank you all for the advice, it is very much appreciated.

That would be hard (and might even be impossible without dependent types.)

It is possible, and may even make for a good exercise if you are interested in learning to program with GADTs. I don’t think it’s a great idea to do a compiler that way though.

1 Like

Can you expand on this? I’m interested in hearing about your experiences.

As exercises I’ve written toy compiler passes which use GADTs to express various well-formedness properties of their output. The result was that I learned a lot about how to program effectively with GADTs, which was nice.

Unfortunately the actual programming sucked. Working with clever GADTs involves paying close attention to rather tedious issues to do with encoding the information you want at the type level; verbose annotations, trivial wrapper datatypes, re-implementing basic data structures with the type-level information that you need added, etc. Error messages are awful and the code becomes both harder to read and less valuable in terms of understanding the problem being solved when you do read it.

Since writing a compiler is enough fun already, I would use more straightforward methods.

1 Like

You can take a look to the tagless final method which can give you a different way to structure your type.