An AST typing problem

This note discusses the beginnings of an OCaml attribute-grammar
evaluator generator. You can find this code on github at camlp5/pa_ppx_ag.

All of this code is implemented using camlp5 and the pa_ppx suite
of PPX rewriters.

Caveat: this code is less than a week old, so it’s changing fast. In
the unlkely event that anybody out there is actually interested in
using this code, I’m happy to help in any way I can. But just be
aware that it’s changing -really- fast.

Attribute Grammars for the multipass AST analysis problem

A year-and-a-half ago, the OP “An AST Typing Problem”
(An AST typing problem) raised the
problem of how to deal with ASTs, in the presence of multiple passes
of program-analysis, each of which will want to hang various bits of
data off nodes. The author of the OP pointed also at a couple of
posts on Lambda-the-Ultimate (LtU), discussing related problems.

The author notes:

There’s a lot of passes, many of which depend on the previous ones,
each one making some slight change to the AST which might or might
not result in having to walk through the whole AST to catch all
occurrences of that particular node. Clearly you’ll want to encode
semantic errors in the types, so each pass ends up having its own
unique AST, each depending on the previous one. To change a single
node deep in the AST I have to write about a hundred lines of types
and mapping functions’ worth of boilerplate. Any change in the
lower levels of the AST bubbles up to the higher ones, and
refactoring becomes a nightmare.

I’ve been thinking about this problem ever since, and at the time, had
suggested that while it seemed like attribute-grammars might be a
workable solution, they were a pretty heavy hammer. It doesn’t help
(of course) that there exist no attribute-grammar evaluator
generators, for OCaml. Also, at least in the LtU threads, there was
discussion of modifying the AST, and having the analyses automatically
be updated for the modified AST. Obviously this would require an
incremental re-attribution algorithm: more complexity and again,
something that isn’t implemented for OCaml.

But imagine that there existed an attribute-grammar evaluator
generator for OCaml. So for a simple language of expressions, with an assignment-operator,
we could write an evaluator as an attribute-grammar.
Imagine that you could write an ast like this
(test1_ast.ml):

type expr =
    INT of int
  | BINOP of binop * expr * expr
  | UNOP of unop * expr
  | REF of string
  | ASSIGN of string * expr
  | SEQ of expr * expr
and unop = UPLUS | UMINUS
and binop = PLUS | MINUS | STAR | SLASH | PERCENT
and prog = expr

and then (having elsewhere written parser/pretty-printer) declare
attributes on those types (test1_variants.ml):

module Attributed = struct
  [%%import: Test1_ast.expr]
  [@@deriving attributed {
    attributed_module_name = AT
  ; normal_module_name = OK
  ; attributes = {
      expr = {
        inh_env = [%typ: (string * int) list]
      ; syn_env = [%typ: (string * int) list]
      ; value_ = [%typ: int]
      }
    ; prog = {
        value_ = [%typ: int]
      }
    ; binop = {
        oper = [%typ: int -> int -> int]
      }
    ; unop = {
        oper = [%typ: int -> int]
      }
    }
  }]
end

and then declare attribute equations (test1_ag.ml):

module REC = struct
[%%import: Test1_variants.Attributed.AT.expr]
  [@@deriving ag {
    module_name = AG
  ; storage_mode = Records
  ; axiom = prog
  ; attributes = {
      expr = {
        inh_env = [%typ: (string * int) list]
      ; syn_env = [%typ: (string * int) list]
      ; value_ = [%typ: int]
      }
    ; prog = {
        value_ = [%typ: int]
      }
    ; binop = {
        oper = [%typ: int -> int -> int]
      }
    ; unop = {
        oper = [%typ: int -> int]
      }
    }
  ; attribution = {
      expr__INT = (
        [%nterm 0].syn_env := [%nterm 0].inh_env ;
        [%nterm 0].value_ := [%prim 1].intval
      )
    ; expr__BINOP = (
        [%nterm expr.(1)].inh_env := [%nterm expr].inh_env ;
        [%nterm expr.(2)].inh_env := [%nterm expr.(1)].syn_env ;
        [%nterm expr].syn_env := [%nterm expr.(2)].syn_env ;
        [%nterm expr].value_ := [%nterm binop.(1)].oper [%nterm expr.(1)].value_ [%nterm expr.(2)].value_
      )
    ; expr__UNOP = (
        [%nterm expr.(1)].inh_env := [%nterm expr].inh_env ;
        [%nterm expr].syn_env := [%nterm expr.(1)].syn_env ;
        [%nterm expr].value_ := [%nterm unop.(1)].oper [%nterm expr.(1)].value_
      )
    ; expr__REF = (
        [%nterm 0].syn_env := [%nterm 0].inh_env ;
        [%nterm 0].value_ := List.assoc [%prim 1].stringval [%nterm 0].inh_env
      )
    ; expr__ASSIGN = (
        [%nterm 0].syn_env := ([%prim 1].stringval, [%nterm expr.(1)].value_) :: [%nterm expr.(1)].syn_env ;
        [%nterm expr.(1)].inh_env := [%nterm 0].inh_env ;
        [%nterm 0].value_ := [%nterm expr.(1)].value_
      )
    ; expr__SEQ = (
        [%nterm 1].inh_env := [%nterm 0].inh_env ;
        [%nterm 2].inh_env := [%nterm 1].syn_env ;
        [%nterm 0].syn_env := [%nterm 2].syn_env ;
        [%nterm 0].value_ := [%nterm 2].value_
      )
    ; prog = (
        [%nterm 1].inh_env := [] ;
        [%nterm 0].value_ := [%nterm 1].value_ ;
        assert True
      )
    ; unop__UPLUS = (
        [%nterm unop].oper := fun x -> x
      )
    ; unop__UMINUS = (
        [%nterm unop].oper := fun x -> (- x)
      )
    ; binop__PLUS = (
        [%nterm binop].oper := (+)
      )
    ; binop__MINUS = (
        [%nterm binop].oper := (-)
      )
    ; binop__STAR = (
        [%nterm binop].oper := fun a b -> a*b
      )
    ; binop__SLASH = (
        [%nterm binop].oper := (/)
      )
    ; binop__PERCENT = (
        [%nterm binop].oper := (mod)
      )
    }
  }]
end

and then, turning a crank, you would get an evaluator:

let test_records ctxt =
  assert_equal 3 ({| x := 1 ; x ; y := 2 ; x + y |} |> pa_prog_attributed |> REC.AG.evaluate)
; assert_equal 0 ({| x := 1 ; y := 2 ; x / y |} |> pa_prog_attributed |> REC.AG.evaluate)

where pa_prog_attributed is a parser that parses the surface syntax
into an AST, which has empty slots for all attributes, and
REC.AG.evaluate evaluates attributes in its argument AST, and then
returns a tuple of all the synthesized attributes of the root node.

Retaining familiar surface syntax for pattern-matching and constructing ASTs

Now, we don’t want to give up easy pattern-matching and construction
of the AST, just because the AST has attributes strewn throughout it.
But we don’t have to: with Camlp5’s “quotations”, once we define a
surface syntax parser for the basic AST (unadorned with attributes –
viz. test1_ast.ml), we can use that to bootstrap ourselves to a
surface syntax parser for expressions and patterns over that AST, and
then in a similar manner we can get them for the AST adorned with
attributes.

This has already been done for hashconsed ASTs, and ASTs with built-in
unique-IDs, and and doing it for “attributed ASTs” isn’t any harder.
Those examples can be found in the github project
camlp5/pa_ppx_q_ast.

Limitations

There are still limitations.

  1. The current code only implements topological-order evaluation.
    That is, it builds the entire dependency-graph, topologically-sorts
    it, and then evaluates attributes. This is … suboptimal, when
    we well know that almost all interesting AGs are already in the
    class of ordered attribute-grammars (OAGs). I plan to implement
    the OAG evaluation strategy next.

  2. Traditionally AGs are defined over “productions” which are
    sequences of nonterminals and terminals. This doesn’t correspond
    to the way we define OCaml constructor data-types. So instead of a constructor like

type expr =
  ... | Call of name * arg_list
and arg_list = NoArgs | SomeArgs of expr * arg_list

we might want to use 'a list

type expr =
  ... | Call of name * expr list

Problem is: defining attribute-equations for (effectively) an array of
nodes, is not part of the standard lingo of AGs. But I believe we can
invent new syntax and make this succinct.

  1. Storage optimization. A naive implementation of AGs can store all
    attributes ever computed, at all the nodes in the AST. This can
    use a lot of memory. But there are well-known techniques to
    discard attributes once they’ll never more be needed in the rest of
    the attribute-evaluation, and I plan to implement these techniques.

There’s an entire literature on things like remote-references in
attribute grammars, aggregates, and other things, all of which can
probably be usefully employed.

Conclusion

I think that attribute-grammars could be a useful way to structure
complex multipass program-analysis, just as they used to do back in
the good ol’ days.

Maybe worth a look-see!

4 Likes