An AST typing problem

I have been working on xobl, a “protocol compiler” for X11 which essentially reads the XML to an AST and applies a series of transformations to infer typings information, so that the backend(s) can use it to produce somewhat idiomatic X11 bindings.

It already kind of works, only missing some more passes and the cruft needed to establish a connection, but I’ve been at it for almost two years now and by now I feel drained just thinking about working on the damn thing; I’ve rewritten the middle-end (the AST transformations) part a few times and I’m never quite satisfied with it.

There’s a lot of passes, many of which depend on the previous ones, each one making some slight change to the AST which might or might not result in having to walk through the whole AST to catch all occurrences of that particular node. Clearly you’ll want to encode semantic errors in the types, so each pass ends up having its own unique AST, each depending on the previous one. To change a single node deep in the AST I have to write about a hundred lines of types and mapping functions’ worth of boilerplate. Any change in the lower levels of the AST bubbles up to the higher ones, and refactoring becomes a nightmare.

I think my method has some strengths, but it’s way too brittle and susceptible to change for my liking.

This problem has already been discussed multiple times in the ML community at large, but I was wondering what methods might work specifically for OCaml and this particular example.

I’ve been thinking about making the middle-end mostly “untyped”, and restoring the encoding of semantic errors in the types only in the final pass just before it’s handed over to the backend. This would allow me to write the mapping functions once and reusing them for every pass, in a similar way to Ast_mapper from the OCaml compiler, sacrificing some type safety.

The other solution I thought of would be to make most types generic, so that each pass would still have its own AST, but I’d still be able to reuse the mapping functions from the previous passes, but I’m afraid that would quickly become a mess of its own.

What do you think?

4 Likes

I’m a bit disappointed no one has chimed in on this; the question is of interest to anyone building a compiler based on micropasses.

1 Like

By the way, I forgot to mention that a nanopass framework does exist for OCaml (though I’m not sure it would help much with my issue).

1 Like

I’ve been thinking off-and-on about your question, which is a really good one. Some thoughts:

(0) I’ve been writing a compiler also, and at this point, it’s up to 2 passes, soon to have 3, and I expect 5. So yeah, I’m thinking about this subject “in anger”.

(1) [I don’t believe you’re wanting this, but I’ll mention it, b/c it was referenced] IIRC one of the lambda-the-ultimate threads mentioned wanting to support: (a) create AST, (b) run some passes, © then rewrite the AST, and (d) expect somewhat automatic re-generation of the attributes that had been computed by those passes in step (b).

It seems like that is asking too much: there’s a large literature on attribute-grammars, and incremental re-attribution after edits (Teitelbaum, Horowitz, Reps, et al over decades). And the basic thing is, if you’re going to do this, you’re going to import the entire machinery of attribute-grammars, and I suspect that that’s far more than you (or I) would want.

(2) Would it be asking too much, if I were to ask that you point at a few examples in your xobl code, that the commentariat could look at as examples? Just pointers, no commentary. Just to fast-forward the discussion? [b/c, well, like I said, I’m in the middle of building a compiler, and don’t have a lot of time to dig thru source code looking for what appear to be likely examples.]

I -do- think this a valuable question on which to move forward the state of knowledge.

I often use a single AST and static capabilities via typing.

For example, I’ll have

module Lang = struct
  type t = 
    | BinOp of op * t * t
    ...
end

module type DeFunI = sig
  type t

  val inject : Lang.t -> t
  val extract : t -> Lang.t
end

module DeFun : DeFunI = struct
  type t = Lang.t

  let inject e = (* defunctionalization pass *)
  let extract e = e
end

module NextPass : NextPassI = struct
  type t = Lang.t

  let inject (e : DeFun.t) = (* next pass *)
  let extract e = e
end

And you can use private types if you want to avoid having to use extract. Not sure if this helps, but its my preferred method to avoid defining a ton of ASTs.

This will allow you to ensure that passes are run in the correct order, but won’t save you from writing your injector incorrectly. I.e. there will be invalid values which are still representable.

EDIT: Of course this only helps if you are doing passes that do desugaring / elimination of particular constructs (like defunctionalization / closure elimination).

EDIT2: I guess I’m a little confused about the original post. The AST typing problem seems to refer to how to annotate an AST with types, but the OP seems to be wondering about how to write compiler passes with minimal boilerplate.

That’s a nice solution. Currently I’m doing something similar using a functor to define passes but I’m redefining the whole AST for each pass; using a looser AST but keeping a similar approach could definitely work for me.

You got me there. I guess I was a little confused too, now that I think about they’re two different issues. Still, resolving types for all the AST nodes is part of what I’m doing in the compiler and I’m interested in solutions to both!

I haven’t looked too closely at attribute grammars honestly, but yeah they look a bit too heavy for this.

Sure, most of the incriminating code is in the elaborate directory. The passes are in the modules that start with p, and each one (except for the first) defines a map function that maps the top-level declarations from one pass to the next. The Pass.Make functor takes care of turning that into a module with a nicer API.

There’s only two “serious” passes because… frankly I got tired of writing all that boilerplate.

This note discusses the beginnings of an OCaml attribute-grammar
evaluator generator. You can find this code on github at camlp5/pa_ppx_ag.

All of this code is implemented using camlp5 and the pa_ppx suite
of PPX rewriters.

Caveat: this code is less than a week old, so it’s changing fast. In
the unlkely event that anybody out there is actually interested in
using this code, I’m happy to help in any way I can. But just be
aware that it’s changing -really- fast.

Attribute Grammars for the multipass AST analysis problem

A year-and-a-half ago, the OP “An AST Typing Problem”
(An AST typing problem) raised the
problem of how to deal with ASTs, in the presence of multiple passes
of program-analysis, each of which will want to hang various bits of
data off nodes. The author of the OP pointed also at a couple of
posts on Lambda-the-Ultimate (LtU), discussing related problems.

The author notes:

There’s a lot of passes, many of which depend on the previous ones,
each one making some slight change to the AST which might or might
not result in having to walk through the whole AST to catch all
occurrences of that particular node. Clearly you’ll want to encode
semantic errors in the types, so each pass ends up having its own
unique AST, each depending on the previous one. To change a single
node deep in the AST I have to write about a hundred lines of types
and mapping functions’ worth of boilerplate. Any change in the
lower levels of the AST bubbles up to the higher ones, and
refactoring becomes a nightmare.

I’ve been thinking about this problem ever since, and at the time, had
suggested that while it seemed like attribute-grammars might be a
workable solution, they were a pretty heavy hammer. It doesn’t help
(of course) that there exist no attribute-grammar evaluator
generators, for OCaml. Also, at least in the LtU threads, there was
discussion of modifying the AST, and having the analyses automatically
be updated for the modified AST. Obviously this would require an
incremental re-attribution algorithm: more complexity and again,
something that isn’t implemented for OCaml.

But imagine that there existed an attribute-grammar evaluator
generator for OCaml. So for a simple language of expressions, with an assignment-operator,
we could write an evaluator as an attribute-grammar.
Imagine that you could write an ast like this
(test1_ast.ml):

type expr =
    INT of int
  | BINOP of binop * expr * expr
  | UNOP of unop * expr
  | REF of string
  | ASSIGN of string * expr
  | SEQ of expr * expr
and unop = UPLUS | UMINUS
and binop = PLUS | MINUS | STAR | SLASH | PERCENT
and prog = expr

and then (having elsewhere written parser/pretty-printer) declare
attributes on those types (test1_variants.ml):

module Attributed = struct
  [%%import: Test1_ast.expr]
  [@@deriving attributed {
    attributed_module_name = AT
  ; normal_module_name = OK
  ; attributes = {
      expr = {
        inh_env = [%typ: (string * int) list]
      ; syn_env = [%typ: (string * int) list]
      ; value_ = [%typ: int]
      }
    ; prog = {
        value_ = [%typ: int]
      }
    ; binop = {
        oper = [%typ: int -> int -> int]
      }
    ; unop = {
        oper = [%typ: int -> int]
      }
    }
  }]
end

and then declare attribute equations (test1_ag.ml):

module REC = struct
[%%import: Test1_variants.Attributed.AT.expr]
  [@@deriving ag {
    module_name = AG
  ; storage_mode = Records
  ; axiom = prog
  ; attributes = {
      expr = {
        inh_env = [%typ: (string * int) list]
      ; syn_env = [%typ: (string * int) list]
      ; value_ = [%typ: int]
      }
    ; prog = {
        value_ = [%typ: int]
      }
    ; binop = {
        oper = [%typ: int -> int -> int]
      }
    ; unop = {
        oper = [%typ: int -> int]
      }
    }
  ; attribution = {
      expr__INT = (
        [%nterm 0].syn_env := [%nterm 0].inh_env ;
        [%nterm 0].value_ := [%prim 1].intval
      )
    ; expr__BINOP = (
        [%nterm expr.(1)].inh_env := [%nterm expr].inh_env ;
        [%nterm expr.(2)].inh_env := [%nterm expr.(1)].syn_env ;
        [%nterm expr].syn_env := [%nterm expr.(2)].syn_env ;
        [%nterm expr].value_ := [%nterm binop.(1)].oper [%nterm expr.(1)].value_ [%nterm expr.(2)].value_
      )
    ; expr__UNOP = (
        [%nterm expr.(1)].inh_env := [%nterm expr].inh_env ;
        [%nterm expr].syn_env := [%nterm expr.(1)].syn_env ;
        [%nterm expr].value_ := [%nterm unop.(1)].oper [%nterm expr.(1)].value_
      )
    ; expr__REF = (
        [%nterm 0].syn_env := [%nterm 0].inh_env ;
        [%nterm 0].value_ := List.assoc [%prim 1].stringval [%nterm 0].inh_env
      )
    ; expr__ASSIGN = (
        [%nterm 0].syn_env := ([%prim 1].stringval, [%nterm expr.(1)].value_) :: [%nterm expr.(1)].syn_env ;
        [%nterm expr.(1)].inh_env := [%nterm 0].inh_env ;
        [%nterm 0].value_ := [%nterm expr.(1)].value_
      )
    ; expr__SEQ = (
        [%nterm 1].inh_env := [%nterm 0].inh_env ;
        [%nterm 2].inh_env := [%nterm 1].syn_env ;
        [%nterm 0].syn_env := [%nterm 2].syn_env ;
        [%nterm 0].value_ := [%nterm 2].value_
      )
    ; prog = (
        [%nterm 1].inh_env := [] ;
        [%nterm 0].value_ := [%nterm 1].value_ ;
        assert True
      )
    ; unop__UPLUS = (
        [%nterm unop].oper := fun x -> x
      )
    ; unop__UMINUS = (
        [%nterm unop].oper := fun x -> (- x)
      )
    ; binop__PLUS = (
        [%nterm binop].oper := (+)
      )
    ; binop__MINUS = (
        [%nterm binop].oper := (-)
      )
    ; binop__STAR = (
        [%nterm binop].oper := fun a b -> a*b
      )
    ; binop__SLASH = (
        [%nterm binop].oper := (/)
      )
    ; binop__PERCENT = (
        [%nterm binop].oper := (mod)
      )
    }
  }]
end

and then, turning a crank, you would get an evaluator:

let test_records ctxt =
  assert_equal 3 ({| x := 1 ; x ; y := 2 ; x + y |} |> pa_prog_attributed |> REC.AG.evaluate)
; assert_equal 0 ({| x := 1 ; y := 2 ; x / y |} |> pa_prog_attributed |> REC.AG.evaluate)

where pa_prog_attributed is a parser that parses the surface syntax
into an AST, which has empty slots for all attributes, and
REC.AG.evaluate evaluates attributes in its argument AST, and then
returns a tuple of all the synthesized attributes of the root node.

Retaining familiar surface syntax for pattern-matching and constructing ASTs

Now, we don’t want to give up easy pattern-matching and construction
of the AST, just because the AST has attributes strewn throughout it.
But we don’t have to: with Camlp5’s “quotations”, once we define a
surface syntax parser for the basic AST (unadorned with attributes –
viz. test1_ast.ml), we can use that to bootstrap ourselves to a
surface syntax parser for expressions and patterns over that AST, and
then in a similar manner we can get them for the AST adorned with
attributes.

This has already been done for hashconsed ASTs, and ASTs with built-in
unique-IDs, and and doing it for “attributed ASTs” isn’t any harder.
Those examples can be found in the github project
camlp5/pa_ppx_q_ast.

Limitations

There are still limitations.

  1. The current code only implements topological-order evaluation.
    That is, it builds the entire dependency-graph, topologically-sorts
    it, and then evaluates attributes. This is … suboptimal, when
    we well know that almost all interesting AGs are already in the
    class of ordered attribute-grammars (OAGs). I plan to implement
    the OAG evaluation strategy next.

  2. Traditionally AGs are defined over “productions” which are
    sequences of nonterminals and terminals. This doesn’t correspond
    to the way we define OCaml constructor data-types. So instead of a constructor like

type expr =
  ... | Call of name * arg_list
and arg_list = NoArgs | SomeArgs of expr * arg_list

we might want to use 'a list

type expr =
  ... | Call of name * expr list

Problem is: defining attribute-equations for (effectively) an array of
nodes, is not part of the standard lingo of AGs. But I believe we can
invent new syntax and make this succinct.

  1. Storage optimization. A naive implementation of AGs can store all
    attributes ever computed, at all the nodes in the AST. This can
    use a lot of memory. But there are well-known techniques to
    discard attributes once they’ll never more be needed in the rest of
    the attribute-evaluation, and I plan to implement these techniques.

There’s an entire literature on things like remote-references in
attribute grammars, aggregates, and other things, all of which can
probably be usefully employed.

Conclusion

I think that attribute-grammars could be a useful way to structure
complex multipass program-analysis, just as they used to do back in
the good ol’ days.

Maybe worth a look-see!

4 Likes