A PPX Rewriter approach to ocaml-migrate-parsetree

TL;DR

Based on camlp5 and the pa_ppx PPX rewriters, I’ve written a new one, pa_deriving_plugins.rewrite, that automates almost all the work of writing a migration from one version of OCaml’s AST to another.

  1. It took a few days (b/c of laziness) to write the initial PPX rewriter
  2. A day to get 4.02->4.03 AST migration working
  3. a couple of hours to get 4.03->4.02 working
  4. and a few more hours to get 4.03<->4.04 and 4.04<->4.05 working

At this point, I fully expect that the other version-pairs will not be difficult.

You can find this code [warning: very much a work-in-progress] at https://github.com/chetmurthy/pa_ppx/tree/migrate-parsetree-hacking

The file pa_deriving.plugins/pa_deriving_rewrite.ml contains the source for the PPX rewriter.

The directory pa_omp contains the migrations, typically named rewrite_NNN_MMM.ml.

A slightly longer-winded explanation

If you think about it, ppx_deriving.map isn’t so different from what
we need for ocaml-migrate-parsetree. ppx_deriving.map, from a type definition for 'a t, will automatically generate a function

map_t : ('a -> 'b) -> 'a t -> 'b t

If you think about it, if we could just substitute our own type for the second occurrence of t (somehow … yeah grin) then it would be almost like what we want for o-m-p, yes?

With 11 versions of the Ocaml AST so far, maybe it’s worth thinking about how to automate more of the migration task. Also, since so much of it is type-structure-driven, one would think that it would be an excellent opportunity to apply PPX rewriting technology. Indeed, one might think that a good test of PPX rewriting, is the ability to automate precisely such tasks.

So what’s hard about this migration task? Here are some issues (maybe there are more):

  1. the types are slightly differently-organized in different versions of the AST. Types might move from one module to another.
  2. sometimes new types are introduced and old ones disappear
  3. constructor data-types may get new branches, or lose them
  4. record-types may get new fields, or lose them
  5. sometimes the analogous types in two consecutive versions are just really, really different [but this is rare]: we need to supply the code directly
  6. when mapping from one version to another, sometimes features are simply not mappable, and an error needs to be raised; that error ought to contain an indication of where in the source that offending feature was found
  7. And finally, when all else fails, we might need to hack on the migration code directly

But morally, the task is really straightforward (with problems listed in-line):

  1. use ppx_import to copy over types from each of the AST times of each Ocaml version

    • ppx_import works on .cmi files, and those have different formats in different versions of Ocaml. Wouldn’t it be nice if it worked on .mli files, whose syntax (b/c OCaml is well-managed) doesn’t change much?
  2. build a single OCaml module that has all the AST types in it (from all the versions of OCaml)

  • but without the form
type t = M.t = A of .. | B of ....
that is, without the "type equation" that allows for a new type-definition to precisely repeat a previous one.  
  1. Then use ppx_import against this single module to construct a recursive type-declaration list of all the AST types for a particular version of OCaml, and apply a “souped-up” version of ppx_deriving.map to it, to map the types to another version of the AST types.

– but ppx_deriving.map doesn’t do this today, and besides, it would have to provide a bunch of “escape hatches” for all the special-cases I mentioned above.

But this is in principle doable, and it has the nice feature that all the tedious boilerplate is mechanically-generated from type-definitions, hence likely to not contain errors (assuming the PPX rewriter isn’t buggy).

So I decided to do it, and this little post is a result.

Discussion

I think this is a quite viable approach to writing ocaml-migrate-parsetree, and I would encourage the PPX community to consider it. One of the nice things about this approach, is that it relies heavily on PPX rewriting itself, to get the job done. I think one of the important things we’ve learned in programming languages research, is that our tools need to be largely sufficient to allow us to comfortably implement those same tools. It’s a good test of the PPX infrastructure, to see if you can take tedious tasks and automate them away.

I’m not going to describe anymore of how this works, b/c I’d rather
get the rest of the migrations working, start figuring out how to
test, and get this code integrated with camlp5.

But for anybody who’s interested, I’d be happy to interactively
describe the code and walk them thru how it works.

4 Likes

For a person who hasn’t digged into OMP, can you explain how it is different from what is done currently? Because the idea I had of OMP is basically what you describe, a set of functions transformation an AST from vX to vX-1 and vX+1. So I am obviously missing something.

Yes, you’re right: imagine a series of modules M2…M11. Each declares the same set of types, but with different definitions, yes? Then you’d have migration modules, migrate_i_j (j=i+1 or j=i-1) that have functions that convert between the analogously-named types. The entire question is: how are these functions implemented? By hand? With significant mechanized support? They can’t be implemented fully-mechanically, because there are decisions to be made about how to bridge differences in type-definitions. For instance, look at the 4.02 type label and the 4.03 type arg_label. Sometimes these are analogous (and sometimes they’re not). When they’re analogous, the code that converts between -cannot- be automatically inferred: a human has to write it. But -most- of the code of these migration functions can be inferred automatically from the type-definitions themselves.

And that’s really all that my little experiment does: automatically infer the migration code (most of the time) with some hints for those cases where it’s not possible to automatically infer.

Now, why would one do this? Well, two reasons:

  1. it should be more maintainable to automatically generate most of the code from types, and it should be quicker to bring online a migration for a new version of the Ocaml AST.
  2. this should be a good test of PPX rewriting. That is, if we’re going to build a macro-preprocessing support system, shouldn’t it be able to make solving such straightforward, but very tedious, problems much, much easier?
1 Like

I forgot to add a third reason why this PPX-rewriter-based approach is better:

  1. If you look at ocaml-migrate-parsetree “migrations”, you’ll find that they’re almost all boilerplate code. But sprinkled here-and-there, is actual logic, actual decisions about how to come up with values for new fields, about which fields, when non-trivial (e.g. not “[]”) should lead to migration-failure, etc. It is this code, that is the actual meat of the migration, and it’s not at all obvious, when sprinkled thru the mass of mechanically-produclble boilerplate.

A mechanized production of that boilerplate would mean that we retained explicitly only this nontrivial code, and hence for maintenance we could focus on it, and make sure it does the right thing.

Figuring out ways to make maintaining this stuff more efficient would be great! One aspect that isn’t clear to me is how this approach compares to the process currently used to generate the omp code. I haven’t done it myself, but at first glance the tools to generate the omp code (e.g. gencopy) seem to also accurately be describable as heavily using ppx infrastructure in order to implement the map code from one version to another. Is there an executive summary that compares and contrasts that and this proposal?

From the README, gencopy is used to generate a prototype file for each migration, and then a human goes in and fixes up the code. A way to put my point is: gencopy should be provided the fixups in some compact form, and apply them itself.

OK, I finished writing the migrations for the OCaml ASTs, so decided that it might be fun to write a little example to demonstrate how this works.

The ASTs

Suppose that I have two AST types (in ex_ast.ml):

module AST1 = struct
type t0 = string
type t1 = A of t1 * int list
type t2 = B of string * t0 | C of bool | D
type 'a pt3 = { it : 'a ; extra : int ; dropped_field: string }
type t4 = t2 pt3
end

module AST2 = struct
type t0 = int
type t1 = A of t1 * int list
type t2 = B of string * t0 | C of int | E
type 'a pt3 = { it : 'a ; extra : int ; new_field : int }
type t4 = t2 pt3
end

These AST types differ in the following ways:

  1. t0 is just completely different in these versions
  2. t1 is identical, but references a type that is different in each version; also an externally-defined polymorphic type-constructor ( 'a list)
  3. t2 differs in each version: it loses a branch (D), and gains a branch (E). It also has a branch whose arguments change type (C)
  4. 'a pt3 similarly loses a field and gains a field; it’s also a polymorphic type
  5. t4 is identical, but again references types that are different in each version.

The Migrations

We’d like to write migrations that incur the -least- amount of boilerplate. What would that mean? For starters, we need to match up the types: there’s no way to avoid that. Next, for branches that change type, we need to provide code. We need a way to flag branches that appear or disappear, and similarly for fields, to compute the value of fields that appear. So we’ll:

  1. list the types we’re going to work with, importing their definitions, so that the migration code will work against those already-defined types
  2. specify pairs of source-type, destination-type patterns, and for each, perhaps supply the function, or code for new/disappearing fields/branches
  3. for polymorphic types, we need to specify how their type-parameters get rewritten.

Below, we’ll describe the process for taking this information and automatically computing the code of these migrations.


module Migrate_AST1_AST2 = struct

module SRC = Ex_ast.AST1
module DST = Ex_ast.AST2

exception Migration_error of string

let migration_error feature =
  raise (Migration_error feature)

let _migrate_list subrw0 __dt__ l =
  List.map (subrw0 __dt__) l

type t0 = [%import: Ex_ast.AST1.t0]
and t1 = [%import: Ex_ast.AST1.t1]
and t2 = [%import: Ex_ast.AST1.t2]
and 'a pt3 = [%import: 'a Ex_ast.AST1.pt3]
and t4 = [%import: Ex_ast.AST1.t4]
[@@deriving migrate
    { dispatch_type = dispatch_table_t
    ; dispatch_table_value = dt
    ; dispatchers = {
        migrate_list = {
          srctype = [%typ: 'a list]
        ; dsttype = [%typ: 'b list]
        ; code = _migrate_list
        ; subs = [ ([%typ: 'a], [%typ: 'b]) ]
        }
      ; migrate_t0 = {
          srctype = [%typ: t0]
        ; dsttype = [%typ: DST.t0]
        ; code = fun __dt__ s ->
            match int_of_string s with
              n -> n
            | exception Failure _ -> migration_error "t0"
        }
      ; migrate_t1 = {
          srctype = [%typ: t1]
        ; dsttype = [%typ: DST.t1]
        }
      ; migrate_t2 = {
          srctype = [%typ: t2]
        ; dsttype = [%typ: DST.t2]
        ; custom_branches_code = function
              C true -> C 1
            | C false -> C 0
            | D -> migration_error "t2:D"
        }
      ; migrate_pt3 = {
          srctype = [%typ: 'a pt3]
        ; dsttype = [%typ: 'b DST.pt3]
        ; subs = [ ([%typ: 'a], [%typ: 'b]) ]
        ; skip_fields = [ dropped_field ]
        ; custom_fields_code = {
            new_field = extra
          }
        }
      ; migrate_t4 = {
          srctype = [%typ: t4]
        ; dsttype = [%typ: DST.t4]
        }
      }
    }
]
end

module Migrate_AST2_AST1 = struct

module SRC = Ex_ast.AST2
module DST = Ex_ast.AST1

exception Migration_error of string

let migration_error feature =
  raise (Migration_error feature)

let _migrate_list subrw0 __dt__ l =
  List.map (subrw0 __dt__) l

type t0 = [%import: Ex_ast.AST2.t0]
and t1 = [%import: Ex_ast.AST2.t1]
and t2 = [%import: Ex_ast.AST2.t2]
and 'a pt3 = [%import: 'a Ex_ast.AST2.pt3]
and t4 = [%import: Ex_ast.AST2.t4]
[@@deriving migrate
    { dispatch_type = dispatch_table_t
    ; dispatch_table_value = dt
    ; dispatchers = {
        migrate_list = {
          srctype = [%typ: 'a list]
        ; dsttype = [%typ: 'b list]
        ; code = _migrate_list
        ; subs = [ ([%typ: 'a], [%typ: 'b]) ]
        }
      ; migrate_t0 = {
          srctype = [%typ: t0]
        ; dsttype = [%typ: DST.t0]
        ; code = fun __dt__ n -> string_of_int n
        }
      ; migrate_t1 = {
          srctype = [%typ: t1]
        ; dsttype = [%typ: DST.t1]
        }
      ; migrate_t2 = {
          srctype = [%typ: t2]
        ; dsttype = [%typ: DST.t2]
        ; custom_branches_code = function
              C 1 -> C true
            | C 0 -> C false
            | C _ -> migration_error "t2:C"
            | E -> migration_error "t2:E"
        }
      ; migrate_pt3 = {
          srctype = [%typ: 'a pt3]
        ; dsttype = [%typ: 'b DST.pt3]
        ; subs = [ ([%typ: 'a], [%typ: 'b]) ]
        ; skip_fields = [ new_field ]
        ; custom_fields_code = {
            dropped_field = string_of_int extra
          }
        }
      ; migrate_t4 = {
          srctype = [%typ: t4]
        ; dsttype = [%typ: DST.t4]
        }
      }
    }
]
end

Using the Migrations

And we can use them in the toplevel to migrate in each direction:

#load "ex_ast.cmo";;
#load "ex_migrate.cmo";;
open Ex_migrate ;;
open Ex_ast ;;
# Migrate_AST1_AST2.(dt.migrate_t4 dt AST1.{ it = C true ; extra = 3 ; dropped_field = "1" });;
- : Ex_migrate.Migrate_AST1_AST2.DST.t4 =
{Ex_migrate.Migrate_AST1_AST2.DST.it = Ex_migrate.Migrate_AST1_AST2.DST.C 1;
 extra = 3; new_field = 3}
# Migrate_AST2_AST1.(dt.migrate_t1 dt AST2.(A(1, [2;3])));;
- : Ex_migrate.Migrate_AST2_AST1.DST.t1 =
Ex_migrate.Migrate_AST2_AST1.DST.A ("1", [2; 3])

How Does It Work?

The migration code is computed in a pretty straightforward manner. In the following, assume we’re migrating from AST1 to AST2 (DST is also the same as AST2). Let’s call a “migration function” something of type

type ('a, 'b) migrater_t = dispatch_table_t -> 'a -> 'b

where dispatch_table_t will be defined later. First, assume (recursively) that each each migration rule yields a
migration function. So a rule like

      ; migrate_t1 = {
          srctype = [%typ: t1]
        ; dsttype = [%typ: DST.t1]
        }

will yield a function of type

(AST1.t1, AST2.t1) migrater_t

and a rule like

      ; migrate_pt3 = {
          srctype = [%typ: 'a pt3]
        ; dsttype = [%typ: 'b DST.pt3]
        ; subs = [ ([%typ: 'a], [%typ: 'b]) ]
        ; skip_fields = [ dropped_field ]
        ; custom_fields_code = {
            new_field = extra
          }
        }

will yield

'a 'b. ('a, 'b) migrater_t -> ('a AST1.pt3, 'b AST2.pt3) migrater_t

Notice that this is a function that takes a migration from 'a to 'b, and produces one from 'a AST1.pt3 to 'a AST2.pt3. So from a source-type, we need to mechanically compute the code, and also compute the result-type. This will be key idea: the migration process, applied to a source-type, will produce both the code that migrates that source-type, and also the destination-type.

Now, consider any rewrite rule, and let’s argue by cases on how its code & destination-type can be computed.

  1. suppose that the source type (the field srctype) of the rule can be head-reduced (by applying some type-definition as an abbreviation) to an algebraic sum-type (A of ... | B of ...) or record-type ({a: t1 ; b : ... }). Then we can generate code to pattern-match on values of the source-type, and for variables in the generated patterns, we know their types. We can then apply (via pattern-matching over types) all the rewrite-rules to compute code that will migrate the variables’ values. Since the recursive application of migration-rules produce destination-types, we can substitute those into branches/fields, to get the destination-type. [More on pattern-matching below.]

    • some of the branches of a sum-type might have their code specified in a custom_branches_code, and those branches supersede what would have been automatically-generated.

    • some of the fields of a record-type are listed in skip_fields, and they’re not generated; some of the fields are listed with code to compute them (based on the fields in the pattern above) in custom_fields_code, and those get added.

  2. suppose that the source type after a single head-reduction yields a type-expression that can be successfully pattern-matched by one of the rewrite-rules other than this rule we’re currently processing; then we can use that rewriter function to migrate the value of the source-type, and again, we get the destination-type.

    • Why other than the current rule? It’s simple: we could end up in an infinite recursion if care isn’t taken to write the migration rules correctly.
  3. The process of pattern-matching takes as input a type-expression, and a migration-rule. If the source-type of the migration-rule matches the type-expression (in the usual sense), then it produces bindings for the type-variables.

Let’s look at an example. Suppose we want to compute the migration function for the type-expression t4.

  1. head-reducing once yields t2 pt3.

  2. the rule migrate_pt3 matches (source-type 'a pt3), with type-variable 'a bound to t2.

    • also, (via the subs field) if the type bound to 'a is rewritten to 'b then the destination-type is 'b AST2.pt3 [remember this below]
  3. the type-expression t2 matches the rule migrate_t2, with no type-variables bound and destination-type AST2.t2.

  4. So migrate_t2 : (AST1.t2, AST2.t2) migrater_t (that is, the destination type is AST2.t2)

  5. Thus in #2 above, 'b is bound to AST2.t2, and hence the destination-type in #2 above is AST2.t2 AST2.pt3

So the whole migration code for type t4 is migrate_pt3 migrate_t2 (which is what we would expect).

[EDITED TO ADD this section on the dispatch table]

The “Dispatch Table”

Throughout this, I haven’t explained what the “dispatch table” is for.
In everything above, we could have generated all the migration
functions in a big let-rec. But that would force us to assume that
the generated code is already correct. What if the generated code is
wrong, and we need to fix it somehow? Should we edit the generated
code? If so, how maintainable is that? What if we need a special
version of the migration from AST1 to AST2, that does something
slightly different than the standard one?

The dispatch table is a way of solving this problem. Suppose that
(for whatever reason) we need to modify the code of one of the
migration functions. The dispatch table has type

type nonrec dispatch_table_t = {
  migrate_list :
    'a 'b.
      ('a, 'b) migrater_t ->
      ('a list, 'b list) migrater_t;
  migrate_t0 : (AST1.t0, AST2.t0) migrater_t;
  migrate_t1 : (AST1.t1, AST2.t1) migrater_t;
  migrate_t2 : (AST1.t2, AST2.t2) migrater_t;
  migrate_pt3 :
    'a 'b.
      ('a, 'b) migrater_t ->
      ('a AST1.pt3, 'b AST2.pt3)
      migrater_t;
  migrate_t4 : (AST1.t4, DST.t4) migrater_t;
}

If we need to modify (say) the code of migrate_t2, we can do so with the code

{ dt with migrate_t2 = .... code using dt.migrate_t2 as a fallback .... }

and because all recursive calls go thru the dispatch table, this new version of migrate_t2 will be used everywhere.

2 Likes

It seems like this declarative approach ought to make it possible to combine multiple migration stages at generation-time and effectively deforest the intermediate AST. I’ll have to do some experimentation …