An AST typing problem

#1

I have been working on xobl, a “protocol compiler” for X11 which essentially reads the XML to an AST and applies a series of transformations to infer typings information, so that the backend(s) can use it to produce somewhat idiomatic X11 bindings.

It already kind of works, only missing some more passes and the cruft needed to establish a connection, but I’ve been at it for almost two years now and by now I feel drained just thinking about working on the damn thing; I’ve rewritten the middle-end (the AST transformations) part a few times and I’m never quite satisfied with it.

There’s a lot of passes, many of which depend on the previous ones, each one making some slight change to the AST which might or might not result in having to walk through the whole AST to catch all occurrences of that particular node. Clearly you’ll want to encode semantic errors in the types, so each pass ends up having its own unique AST, each depending on the previous one. To change a single node deep in the AST I have to write about a hundred lines of types and mapping functions’ worth of boilerplate. Any change in the lower levels of the AST bubbles up to the higher ones, and refactoring becomes a nightmare.

I think my method has some strengths, but it’s way too brittle and susceptible to change for my liking.

This problem has already been discussed multiple times in the ML community at large, but I was wondering what methods might work specifically for OCaml and this particular example.

I’ve been thinking about making the middle-end mostly “untyped”, and restoring the encoding of semantic errors in the types only in the final pass just before it’s handed over to the backend. This would allow me to write the mapping functions once and reusing them for every pass, in a similar way to Ast_mapper from the OCaml compiler, sacrificing some type safety.

The other solution I thought of would be to make most types generic, so that each pass would still have its own AST, but I’d still be able to reuse the mapping functions from the previous passes, but I’m afraid that would quickly become a mess of its own.

What do you think?

4 Likes
#2

I’m a bit disappointed no one has chimed in on this; the question is of interest to anyone building a compiler based on micropasses.

1 Like
#3

By the way, I forgot to mention that a nanopass framework does exist for OCaml (though I’m not sure it would help much with my issue).

1 Like
#4

I’ve been thinking off-and-on about your question, which is a really good one. Some thoughts:

(0) I’ve been writing a compiler also, and at this point, it’s up to 2 passes, soon to have 3, and I expect 5. So yeah, I’m thinking about this subject “in anger”.

(1) [I don’t believe you’re wanting this, but I’ll mention it, b/c it was referenced] IIRC one of the lambda-the-ultimate threads mentioned wanting to support: (a) create AST, (b) run some passes, © then rewrite the AST, and (d) expect somewhat automatic re-generation of the attributes that had been computed by those passes in step (b).

It seems like that is asking too much: there’s a large literature on attribute-grammars, and incremental re-attribution after edits (Teitelbaum, Horowitz, Reps, et al over decades). And the basic thing is, if you’re going to do this, you’re going to import the entire machinery of attribute-grammars, and I suspect that that’s far more than you (or I) would want.

(2) Would it be asking too much, if I were to ask that you point at a few examples in your xobl code, that the commentariat could look at as examples? Just pointers, no commentary. Just to fast-forward the discussion? [b/c, well, like I said, I’m in the middle of building a compiler, and don’t have a lot of time to dig thru source code looking for what appear to be likely examples.]

I -do- think this a valuable question on which to move forward the state of knowledge.

#5

I often use a single AST and static capabilities via typing.

For example, I’ll have

module Lang = struct
  type t = 
    | BinOp of op * t * t
    ...
end

module type DeFunI = sig
  type t

  val inject : Lang.t -> t
  val extract : t -> Lang.t
end

module DeFun : DeFunI = struct
  type t = Lang.t

  let inject e = (* defunctionalization pass *)
  let extract e = e
end

module NextPass : NextPassI = struct
  type t = Lang.t

  let inject (e : DeFun.t) = (* next pass *)
  let extract e = e
end

And you can use private types if you want to avoid having to use extract. Not sure if this helps, but its my preferred method to avoid defining a ton of ASTs.

This will allow you to ensure that passes are run in the correct order, but won’t save you from writing your injector incorrectly. I.e. there will be invalid values which are still representable.

EDIT: Of course this only helps if you are doing passes that do desugaring / elimination of particular constructs (like defunctionalization / closure elimination).

EDIT2: I guess I’m a little confused about the original post. The AST typing problem seems to refer to how to annotate an AST with types, but the OP seems to be wondering about how to write compiler passes with minimal boilerplate.

#6

That’s a nice solution. Currently I’m doing something similar using a functor to define passes but I’m redefining the whole AST for each pass; using a looser AST but keeping a similar approach could definitely work for me.

You got me there. I guess I was a little confused too, now that I think about they’re two different issues. Still, resolving types for all the AST nodes is part of what I’m doing in the compiler and I’m interested in solutions to both!

#7

I haven’t looked too closely at attribute grammars honestly, but yeah they look a bit too heavy for this.

Sure, most of the incriminating code is in the elaborate directory. The passes are in the modules that start with p, and each one (except for the first) defines a map function that maps the top-level declarations from one pass to the next. The Pass.Make functor takes care of turning that into a module with a nicer API.

There’s only two “serious” passes because… frankly I got tired of writing all that boilerplate.