The future of ppx

Dear all,

If you are enjoying the many benefits of ppx, you probably know that it comes at a very high price: every OCaml release can potentially break your code. If you are the author of a ppx rewriter, it is likely that you often need to upgrade your code and re-release it.

This has happened several times since the ppx system has landed, and we recently decided that it was time to solve this problem once and for all. In this post, I will discuss the current state of affairs, why we want to change it and how we plan to do it.

This initiative is pushed by Jane Street as we are a big consumer of ppx rewriters. It will be executed by the ppxlib team, currently formed of myself, @xclerc and @NathanReb, as well as Carl Eastlund who is joining us for this particular effort.

The current state of the ppx world

The current ppx world is composed of various components. I am quickly describing the ones I know in this section. The order doesn’t correspond to the order in which these components where developed.

The -ppx option of the compiler

ocamlc and ocamlopt both take a -ppx command line option. This option takes as argument a program that is executed during the compilation of a file in order to transform it on the fly. This program is called a ppx rewriter or ppx for short. More precisely, once the OCaml compiler has parsed the source file and constructed an in-memory representation of its structure, often called an Abstract Syntax Tree or AST for short, it runs the ppx with this AST as input. The ppx returns a new transformed AST and the compiler continues the compilation process with this new AST, discarding the original one.

Several -ppx options can be passed to the compiler. In this case, the compiler will apply the various ppx rewriters one by one, each one feeding its output to the next one.

When this option was introduced, the language was also augmented with extension points and attributes, giving ppx rewriters hooks to embed foreign DSLs in OCaml source files for their own purposes.

Ppx rewriters are typically OCaml programs that use the internal compiler libraries to analyse and transform the AST. Most often, they simply expand a few specific extension points and/or interpret a few attributes.

The -ppx option, a few modules of the compiler libraries, extension points and attributes form the basis of the ppx system. These were developed and integrated in the OCaml compiler mostly by @alainfrisch several years ago. The original motivation for this work was to provide a technically simpler replacement of Camlp4 as well as enforce a more uniform syntax of the language. Camlp4 was the previous official meta-programming system for OCaml.

ppx_tools

ppx_tools is the original toolbox for authors of ppx rewriters. It is composed of a library of helpers and a couple of tools. It was originally developed by @alainfrisch.

ocaml-migrate-parsetree

The compiler libraries are unstable and often change in incompatible ways. This includes the definition of the AST. ocaml-migrate-parsetree is a library that exposes the AST definition of each major version of the compiler as a separate module, as well as migration functions to convert between the various versions. A ppx rewriter can then choose one single version of AST to work with and ocaml-migrate-parsetree will do the necessary conversions to allow the ppx rewriter to be used with a different version of the compiler.

In addition, it also provides a small driver functionality, which allows to link several ppx rewriters into a single executable in order to use a single -ppx option of the compiler rather than several ones. This allows ocaml-migrate-parsetree to perform the minimum number of AST conversions in order to speed up the overall process.

Finally, ocaml-migrate-parsetree snapshots not only the AST but also the few modules from the compiler libraries that form the basis of the ppx system. This was done to ease the port of existing ppx rewriters to ocaml-migrate-parsetree, however we have now come to regret this choice has it makes it difficult to support new versions of the compiler.

ocaml-migrate-parsetree was initially developed by @let-def, and I myself joined the project in its early days as I was eager to use it for the Jane Street suite of ppx rewriters.

ppx_tools_versioned

ppx_tools_versioned extends ocaml-migrate-parsetree to ppx_tools. More precisely, ppx_tools_versioned is a package that contains one full copy of ppx_tools for each version of the AST snapshoted by ocaml-migrate-parsetree. This allowed ppx rewriters using ppx_tools to be easily ported to ocaml-migrate-parsetree.

ppx_tools_versioned was created by @let-def.

ppx_deriving

ppx_deriving is a ppx rewriter that allows to automatically derive code from type definitions. A list of derivers can be attached to a type definition via a [@@deriving] attribute. ppx_deriving provides a few derivers and third party projects can implement their own derivers. Each deriver must register itself against the ppx_deriving library. For this reason, the various derivers must be linked inside the same executable. To this purpose, ppx_deriving offers a driver functionality. This driver supports both static and dynamic linking of the various plugins.

ppx_deriving predates ocaml-migrate-parsetree, however nowadays the driver part of ppx_deriving is using the ocaml-migrate-parsetree driver as backend so that ppx_deriving, deriving plugins and other ppx rewriters can be linked as part of the same ppx driver. Apart from that, ppx_deriving is still based on the current version of the OCaml AST, meaning that every new OCaml releases can potentially break it.

ppx_deriving was developed by @whitequark in the early days of ppx. Nowadays @gasche is the main maintainer of ppx_deriving.

ppxlib

ppxlib is a comprehensive library that exposes a higher level abstraction for authors of ppx rewriters. More precisely, ppx rewriters are no longer seen as blackboxes that transform the full AST and must be applied one by one. Instead, extension points are seen as compile time functions that are evaluated in a top-down manner. Not only this leads to much better performances as the whole rewriting is always done in a single pass, but it also provides a much better model for authors and users of ppx rewriters. In particular, it is much easier to reason about how several ppx rewriters compose with each other. Without ppxlib, it is up to the final user to understand the low-level details of the various ppx rewriters in order to understand whether they can be used simultaneously and how.

Ppxlib also provides safety guarantees by checking that all attributes are interpreted, ensuring that typying mistakes are caught instead of being silently ignored. This was in fact the original motivation for the development of ppx_core, the ancestor of ppxlib.

Ppxlib exposes a Ppxlib.Deriving module providing the same functionality as ppx_deriving. A small common dependency called ppx_derivers ensures that deriving plugins based on either ppxlib or ppx_deriving can be used simultaneously.

It also offers a driver functionality which is built on top of the ocaml-migrate-parsetree one to ensure maximum inter-operability. The library itself is based on one selected version of the AST exposed by ocaml-migrate-parsetree. This way, when a new OCaml compiler is released ppxlib and ppx rewriters based on ppxlib usually continue to work as before. However, when ppxlib bumps the version of the AST it is based on, all clients of ppxlib can potentially break and need to be upgraded and re-released.

ppxlib is the result of a merge between several older ppx projects. These projects were developed at Jane Street and started during the port of our code base from Camlp4 to ppx that was performed by myself and Nick Chapman. I am the original authors of a lot of the architecture and code of ppxlib, although some of the code of Ppxlib.Deriving is much older than this and dates back from the Camlp4 days.

dune

dune is not strictly part of the ppx world. However, it orchestrates their compositions by linking static ppx drivers on demand. Dune doesn’t support arbitraty ppx rewriters, only the ones that can be linked together as part of the same driver. Additionally, when doing so all the ppx rewriters must be based on the same driver backend.

Nowadays, the vast majority of ppx rewriters are based on the ocaml-migrate-parsetree driver in one way or another.

Why is ppx so painful?

The main reason why ppx rewriters are so much pain is because the system is based on the compiler libraries. The compiler libraries are meant for experts and provide no stability guarantee. With such an unstable basis, it is no surprise that the whole system keeps breaking all the time.

ocaml-migrate-parsetree helped the situation by allowing to sandbox individual ppx rewriters into a protective layer. However, this sandboxing means that this method is not applicable when ppx rewriters need to inter-operate with each other in more sophisticated ways such as with ppx_deriving or ppxlib. Moreover, a user of ppx rewriters cannot use new language features until all the ppx rewriters it uses are based on the new version of the AST. Which means that all ppx rewriters still need to be updated and re-released after a new release of OCaml.

Finally, there are just too many projects doing the same thing which makes everything really confusing.

What’s the plan?

The plan is to provide a stable base for the whole system that doesn’t break when a new compiler is released or require all ppx rewriters to be re-released. Because there are some complicated problems that cannot be solved without breaking API change, this new base will be released as new package that will be called simply “ppx”. Although a large part of it will be imported from ppxlib.

We will ensure that one way or another, ppx rewriters based on ppx can be used in conjunction with ppx rewriters based on ppxlib, ocaml-migrate-parsetree, ppx_deriving, ppx_tools, … This will provide a smooth transition story from the old to the new world. I discussed with the authors of the various projects to make sure they are happy with the idea of this new project eventually replacing everything else, to make sure we are not creating “yet another standard”. I also discussed with the other OCaml core developers to establish a strong link between “ppx” and the compier and make sure that the compiler will never break “ppx”. In particular, it will become much easier to test the trunk of OCaml against all released opam packages.

I am hoping that the stability guarantee provided by this new base will be enough of an incentive for authors of ppx rewriters to switch to it. However, if you have any concern about this plan, please raise them here or to me privately as soon as possible.

What does it mean for the AST?

The main difficulty of this project is to design a stable representation of the OCaml AST. What I mean by a stable AST is the following: a given file will always parse to exactly the same value no matter the version of the compiler. If this property is true, then one can have good confidence that an AST transformation written now will continue to be valid for a long time.

This is currently not true as the types used to represent the AST keep changing in breaking ways. For this reason, ppx rewriters will no longer see the AST used inside the compiler. Instead, they will work on a different AST that is more loose and allows to represent more than just the current language. In particular, this AST should be able to represent any future version of the language. However, the use of private types and construction functions will ensure that ppx rewriters can only construct valid AST fragments.

We are not far enough into this project to know the final representation of the AST. However, to illustrate the idea, here are two examples of what such an AST could look like (I omitted the locations to keep the examples simple):

(* Example 1: plain s-expressions *)

type t =
  | Atom of string
  | List of t list

type structure = private t
type expression = private t
...

(* Representation of [let x = 1 in x + 1]:

   {[
     List [Atom "let";
           List [Atom "binding";
                 List [Atom "ident"; Atom "x"];
                 List [Atom "int"; Atom "1"]];
           List [Atom "apply";
                 Atom "+";
                 List [List [Atom "ident"; Atom "x"];
                       List [Atom "int"; Atom "1"]]]]
   ]}
*)

(* Example 2: adding slightly more structure *)

type t =
  | Int of int
  | String of string
  | Ident of string
  | Let of t list * t
  | Binding of { lhs : t; rhs : t }
  | Apply of t * t list
  | ...

type structure = private t
type expression = private t
...

(* Representation of [let x = 1 in x + 1]:

   {[
     Let ([Binding { lhs = Ident "x"; rhs = Int "1" }],
           Apply (Ident "+", [Ident "x"; Int "1"]))
   ]}
*)

It is easy to see that such ASTs can easily be extended without breaking backward compatibility. Constructions functions would ensure that all values of type structure or expression that can be produced are valid AST fragments, i.e. ones that are part of the OCaml language.

For pattern matching, we will provide view patterns based on the ideas used in ppx_view so that programmers don’t accidently write non-sensical patterns, i.e. patterns that can never match anything because they match on values that cannot be produced by the parser. Another way is by testing coverage: if one can reach 100% coverage then this is a proof that all the patterns are valid.

Using an AST that is not the one of the compiler might seem contrary to the philosophy of ppx. However, @alainfrisch mentioned to me that he did envision that ppx rewriters would use a different more stable AST when designing the original ppx feature. So I would say that we are making ppx what it was meant to be rather than diverging from it.

Conclusion

Ppx has a long and storied history leading to a complex stack that is difficult to maintain. We now want to clean it all and restart fresh with a strong base. While doing so, we are opening the discussion about this work early, and most importantly before the point of no return. So I definitely encourage anyone who is interested by all this or will be affected by these changes to chime in, ask for precisions, challenge the technical decisions and raise any concern so that together we can build a better, stronger and more unified ppx ecosystem!

Additionally, all this work will be done entirely in a pure open source fashion, which will make it easy for everyone to follow and/or contribute. In particular, help is most definitely welcome :slight_smile: Ppx rewriters are used by a lot of people, so this work will benefit a large part of the OCaml and even Reason communities. So if you are new to OCaml and and motivated to make an impact, then this is definitely a project to consider.

56 Likes

It reminds me a bit of the situation around LLVM/Clang-based tools in MLIR presentation. See pages 7-8, mentioning SIL (Swift AST-like intermediate language) and non-existent CIL (missing part of the Clang/LLVM toolchain). While there is a question about AST rather than IL, I think the similar logic applies. Like at some point Rust introduced medium-level IL, it might be helpful to think about this proposal as of “medium-level AST”. Or at least I think so (and MLIR authors as well - see slide 43 of the same presentation).

3 Likes

Wouldn’t it make sense to use an extensible data type here, rather than a stringly-typed one? It seems reasonable to expose constructors for the current language constructs, with the guarantee that they won’t change and that old code that used to be encoded using them will still be so.

1 Like

Actually, you don’t need extensible variants, plain ones would do just as well. But yes, that’s an option to consider. The risk with this approach is that after a few iterations, the AST would look like a series of patches against an initial AST. In all cases, the tradeoff is the same: you have a datatype that can represent more than the set of values that can be produced by the parser and you rely on smart constructors to enforce that only valid values can exist.

Are you sure? The reason I proposed using an extensible variant here is to force users to always handle a default case. If it’s a plain variant, users may handle all currently-defined cases, and their code may break when one is added later, no?

True, but I think the stringly-typed representation fundamentally would have similar problems; they’d just be perhaps less severe but also less apparent due to the implicitness of the interface.
For example, hypothetically say OCaml adds qualifiers such as “protected” on type declarations. If a type t = ... declaration without any qualifier used to parse as List [Atom "type"; Atom "t"; ...], it would need to keep parsing as such. On the other hand, type declarations with a non-empty list of qalifiers would need to contain an additional list in their representation, so that their syntax wouldn’t be a generalization of the no-qualifiers case — therefore, we’d effectively have two distinct “shapes” of type declarations due to the historical patchwork of backward-compatible language evolution.

In release mode, for instance when installing via opam install .., this is usually just a warning. So you’d get the following behaviour when the AST is extended:

  • the ppx would continue to build, except with a warning
  • existing source files would continue to be pre-processed the same way
  • source files using the new language feature would trigger a runtime error in the ppx

However, while working on the ppx, i.e. in development mode the developer would get a hard error they have to fix.

BTW, what do you have in mind regarding “protected” type declarations? i.e. what would be the syntax? In any case, the s-expression representation is probably a bit too extreme. The encoding in Example 2 is probably more realistic.

Hi,

I understand that designing a clean AST is already a challenging task,
but I’d neverthess like to ask for even more!

I am very interested in developing source-to-source typed-directed
program transformations. Whereas a ppx transformer takes an untyped
AST and produces an untyped AST, I’d like to be able to write a
transformer that takes a typed AST and produced an untyped AST
(because producing types on-the-fly is a lot of unecessary work).

I believe that it would be worth thinking about whether it is
possible to define a single “optionally typed AST”, which could
be used both for the purpose of transforming untyped code and
typed code. There are at least two important benefits to having
a single AST: first, it avoid significant effort duplication;
second, it enables a type-directed transformer to return its argument
whenever no changes need to be applied.

If you think that it is viable to anticipate for support of optional
type information in your “will-remain-fixed-for-ever AST”, I would
be very interested in helping out before it’s too late…

A few remarks:

(1) In OCaml, the parsetree and the typetree are already quite close,
(see my post: What would it take to unify the parsetree with the typedtree?)
so I see no reason why it would be an issue to adding extra type
information to a parse tree.

(2) A number of existing ppx-transformer could take advantage of
extra information brought by the type system, e.g. disambiguation
of “Some (x,y)” as 1-argument constructor rather than a 2-argument
constructor.

(3) In terms of AST representation, I imagine something along the
lines of:

type exp = { exp_loc; exp_desc; exp_attribute;
exp_env: env option; exp_type : typ option }

or, equivalently,

type exp = { exp_loc; exp_desc; exp_attribute; exp_typing }
type exp_typing = Typ_unknown | Typ_known of { typ; env }

(4) Ideally, the toolchain would provide a way to refine an untyped
AST into a typed AST, so that one may iterate several typed-directed
transformation.

(5) It would be useful for the typed AST to include type variable
binding locations.
(see my post: Keeping track of where type variables are bound in the typed AST)

(6) As use case for type-directed program transformations, I have a
collection of high-level data structure transformations that I have
starting working on, but got stuck due to the need for types and fully
resolved names. For example, one transformation consists of eliminating
a record field when it is not used or can be deduced from the value of
the other fields from the record. Another transformation consists of
inlining a record into another one, even though the former has been
implemented in a separate library.

(7) Keep in mind that your future AST, whether or not it carries type,
will be useful not only for source-to-source transformations, but also
for source code analyses or translations to other languages. There are
many potential applications. One that I care for is translating typed
OCaml code into “characteristic formulae”, which enables formal proofs
of imperative OCaml code using Separation Logic in Coq. At the moment,
my tool is implemented as a fork of the OCaml compiler, but this is
obviously very unsatisfactory.

(8) I am personally not entirely convinced that it will work out to freeze the
AST forever, so even if it changes very rarely, it might still be a good idea
to include some kind of version number to enable coexistence of several
versions of the AST if it turns out to be absolutely necessary one day.

Thanks
Arthur

2 Likes

While S-expressions are an excellent choice for representing code (tree) and code transformations, as proved by various Lisp implementations, the particular Janestreet’s Sexplib’s representation of s-expressions itself might be not the most optimal one. There are a couple of problems with the following representation

    type sexp = 
      | Atom of string 
      | List of sexp list

First of all, it hardcodes the list and string types. Second, it uses an inductive list type inside of another inductive type, instead of unfolding the induction inside of the single type definition. This all prevents an essential feature of Lisp’s data representation - hash-consing. And this is the feature that we will need to make the transformation fast, especially since the new ppx representation will involve translations of the ast to ppx and the other way around.

The first thing that should be done to enable hashconsing is to unfold the list, e.g.,

  type sexp = 
    | Atom of string
    | Cons of {car: sexp; cdr: sexp}

The next step is to parametrize over the element type, which will generalize the representation so that we can reify hash-consing into concrete types (e.g., represent terms using integers),

  type 'a sexp = 
    | Atom of 'a
    | Cons of {car: 'a sexp; cdr: 'a sexp}

To alleviate the pain of using such representation as well as to ensure the correctness, it should be actually hidden from the end user by abstraction boundaries, e.g.,

type 'a sexp

val atom : 'a -> 'a sexp
val cons : 'a sexp -> 'a sexp -> 'a sexp
val case : 'a sexp -> 
    atom : ('a -> 'r) -> 
    cons : ('a sexp -> 'a sexp -> 'r) -> 'r

Hiding the representation under the abstraction will allow us later to change the underlying representation so that the technical dept of incorrect decisions won’t accumulate with time forcing us to initiate a yet another interface breaking change.

An interesting example of a possible representation of sexp data, is just a cons cell represented with one machine word. Provided that the total number of different kinds of branches in OCaml AST is not extremely huge, it should be possible to represent sexp as an Int63.t word with all trees and constants interned.

5 Likes

Maybe this is out of scope, but unscientifically, 99% of my use of ppx is only ever to derive printers, equality and comparison, etc.

Both Haskell and Rust have solved this problem invoking 0 amount of pain, and using deriving for these purposes is actually quite pleasant and gets out of the way, letting you move on with non boilerplate code, which is originally what they’re for.

On the other hand using OCamls ppx from a build ergonomic perspective is still pretty unpleasant. Just the other week I had the pleasure of doing this again for the first time in sometime in dune, and it was still really not great. Specifying it manually in whatever build system one uses, knowing the package name, the ppx compiler flag, dealing with obscure setup issues, these are all uninteresting minutae or esoterica that are inflicted on users, and it doesn’t have to be this way (in theory).

In the time it took me to setup ppx and get it working, just to print something while debugging a small program, I could have manually wrote the printer and be done.

So is there an opportunity here to make deriving more first class so one doesn’t have to deal with all the current beaurocracy just to auto gen some code to test if two things are equal or print them out, similar to how other languages have solved this problem?

I don’t know what this would involve, whether it would be too invasive with special casing in the compiler or whatever but I’m hoping there’s at least some middle ground between here (ppx) and Haskell’s deriving?

10 Likes

Hi,

While I understand why you’d want types to implement some transformations, I think your points are somewhat remote from the issue at hand.

Recently, it has become very hard, for the maintainers of the compiler, or people writing new language features, to test unreleased versions of the compiler. 4.08 for instance (which was branched in January and actually entered a feature freeze a couple of months before that), wasn’t really tested until beta3 a couple of weeks ago.

The main reason for that being that most of the code out there depends on the ppx rewriters in one way or another, and so couldn’t actually be built.

On the other hand, typed transformations would, by definition, happen after type-checking. So my intuition would be that they live out of the critical path. I can imagine situations where the build might depend on them even today, but that seems kind of crazy and also doesn’t match your examples.
Actually, what your examples make me think of is refactoring (similar to what the Rotor project is about), not preprocessing.
I can see how refactoring tools would suffer from the same issues as ppx rewriters do now, but:

  • these tools are currently mostly hypothetical (I guess that’s something you want to change!)
  • while it seems likely that a similar solution could be applied for refactoring tools, it’s unclear why we’d want to share the solution.

At this point I feel like the current proposal fix a very real problem, and will affect basically every user of OCaml. Whereas in your case the problem hasn’t really happened yet.
I’m not saying we shouldn’t think about it, simply that:

  • it feels to me that there are many other things to think about for your project, and that it should be handled separately, perhaps reusing some of the ideas presented here, perhaps not.
  • we definitely shouldn’t noticeably slow down fixing the existing problem.
1 Like

@charguer, type directed program transformations definitely seems cool, but I agree with @trefis here, it has to be kept outside the scope of this work otherwise we’ll never see the end of it. To put things into perspective, this work is about building a healthy ppx ecosystem and that’s what we should focus on.

@ivg, i feel like you might have a concrete proposal in mind, care to share it? :slight_smile:

@m4b, IIUC you are thinking of limiting the scope of ppx to a few predefined derivers implemented by the compiler itself? I feel like it wouldn’t be enough to completely replace the current use cases for ppx, meaning that we’d be stuck with the current world and all its issues. But more importantly, I don’t see such a feature being merged in the compiler.

1 Like

Well, it is not like a concrete proposal, it is just we’re basically tackling with the same problems in BAP, and we also end up using simple s-expressions to represent any structured AST, therefore I’m eager to share my experience. Basically, what I’m proposing/suggesting for the purposes of the ppx rewriting is also to use s-expressions, but in the interned form, e.g., instead of using the Sexplib.t representation use the following

  type tree = {
    uid : int;
    eid : int;
    car : tree;
    cdr : tree;
  }

  let rec nil = {
    uid = 0;
    eid = 0;
    car = nil;
    cdr = nil;
  }

In this representation,

  • eid - the equivalence class number, so that all trees which have the same equivalence number are structurally equal;
  • uid - the unique identifier of a tree, is basically a reification of a pointer, so that we can maintain external mapping from subtrees to such properties as program locations, types, etc
  • car and cdr - are the pointers to the left and right subtree correspondingly (of course a less traditional but a more conventional lhs, rhs names could be used, I chose those names just to highlight that this is still plain old Lisp data).

When a tree is built from an ast it is interned. You can use a state monad or just intern them in hash tables. We will stick to the latter. Suppose we have a simple model of the OCaml AST,

module Ast = struct
  type ident = string [@@deriving compare, hash, sexp]

  type t =
    | Var of ident
    | Int of int
    | Let of ident * t * t
    | App of ident * t list
  [@@deriving compare, hash, sexp]

  include Base.Comparable.Make(struct
      type nonrec t = t [@@deriving compare, sexp_of]
    end)
end

And use hash tables to intern ast,

  module Source = struct
    let last_uid = ref 0
    let last_eid = ref 0

    let trees = Hashtbl.create (module Ast)
    let exps =  Hashtbl.create (module Int)

    let intern ?(car=nil) ?(cdr=nil) exp =
      incr last_eid;
      incr last_uid;
      Hashtbl.add_exn exps ~key:!last_eid ~data:exp;
      {uid = !last_uid; eid = !last_eid; car; cdr}

    let unique t =
      incr last_uid;
      {t with uid = !last_uid}
  end

The construction of a tree from an expression is quite simple and could be easily implemented for arbitrary trees,

  let rec of_exp exp : tree =
    match Hashtbl.find Source.trees exp with
    | Some t -> refresh t
    | None ->
      let tree = match exp with
        | Var _ | Int _ | App (_,[]) -> Source.intern exp
        | Let (_,x,y) ->
          Source.intern ~car:(of_exp x) ~cdr:(of_exp y) exp
        | App (_,x :: xs) ->
          Source.intern ~car:(of_exp x) ~cdr:(of_exps xs) exp in
      Hashtbl.set Source.trees ~key:exp ~data:tree;
      tree
  and of_exps = function
    | [] -> nil
    | x :: xs -> Source.intern ~car:(of_exp x) ~cdr:(of_exps xs) x

where the refresh function assigns new identifiers for an already hashconsed expression (we would like to identify structurally equal expressions as different)

  let rec refresh t = match t with
    | {uid=0} -> t
    | {car; cdr} -> Source.unique {
        t with
        car = refresh car;
        cdr = refresh cdr
      }

The translation back is even simpler,

let to_exp tree = Hashtbl.find_exn Source.exps tree.eid

What this setup gives us is:

  1. O(1) comparison, we can compare trees for equality as simple as
    let compare_tree {eid=x} {eid=y} = x = y
  2. We can perform rewriting without actually touching the tree, by representing it as a mapping from the old uid to the new uid.
  3. We can represent patterns as trees and we can even embed holes into them for bindings, e.g., if we represent the hole as a tree with eid=0, then we can extend our matching comparison function so that eid=0 matches with any other eid and bind the matched subtree with the uid of the hole.

The two main drawback of this approach are:

  1. The tree construction is a little bit costly as we have to pay the cost of comparisons upfront. This is a common limitation of interning/hash-consing. The cost of comparison is amortized, so if you’re not going to perform lots of comparisons you may end up with the negative net cost.
  2. Tree is not a first class representation and extracting information from it is not very straightforward.

I believe that in the rewriters we actually (1) do a lot of comparisons and (2) not really interested in the content of the matched expression as long as it matches. Also, exposing too much information in the tree beyond its structure will make the representation fragile to changes of the underlying AST. Therefore this scheme should fit nicely. But of course, the representation shouldn’t be exposed to the end user and be well hidden under the abstraction boundary. The ppx library should provide a few functional constructors for patterns and trees.

Here is the gist with full code from this posting.

5 Likes

Thanks for sharing. Do you have an example of what a rewriting looks like in BAP?

Oh, it’s much more complicated, since it involves semantics :slight_smile: And we do not perform rewriting in the way that I’m suggesting (because we don’t really need it). In short, we represent data as tagged words which are basically pointers into our own runtime. The data representation is ascribed with semantics by interpreting it in terms of some typed module structure, called theory in our parlance. There could be many theories, the Core Theory is an example. That’s all is heavily influenced and inspired by the Tagless Final approach, i.e., have abstractions instead of concrete representations. And any representation that could be either/or reflected from this abstraction (i.e., interpreted) or reified (i.e., be the interpreter), could be used interchangeably. The reason why we need this, is that we need to reason about the semantic structure of a program, i.e., we need to understand the difference between add x y and sub x y and fadd x y. From the perspective of syntactic rewriting the semantics is not really important, what matters is the structure, which is captured by the tree, and syntactic equality, which is represented by the equality class. We do perform some syntactic rewriting in our Lisp macro system, but it is a much simpler and more constrained system than ppx, so it is just a couple of lines of code.

It might be interesting to look into how we represent our Lisp programs. We actually are using the same idea of indexed trees, with two indices - equality and identity. We’re translating the parse tree (which is represented with Parsexp.Cst.t) into indexed tree, using the same approach as in the proposal, modulo that we’re not using hash tables, but normal functional maps. We then use identity indices to easily map subtrees to their types as well as to their origins in the source code (which is not trivial since we have macros).

1 Like

No not limiting the scope; I’m suggesting descoping specific features, like eg deriving, and moving them into the language as a first class construct, similar to what Haskell and rust have done.

The separate work of fixing ppx as you suggest would be independent of this (and also highly valued, thank you!)

As I said the majority of my usecases and associated pain is just adding derives, which involves imho unnecessary build infrastructure, boilerplate, and satisfying the esoteric bureaucracy of it all - all when I just want to derive a debug printer, print, and then move on with my life.

Anyway just my two cents and experience with the whole thing, as you said probably unlikely derives will be added/fixed in this manner, but I can still dream.

5 Likes

My take on this would be to be able to write something like: (type int) or (type (string * int) list) and have the compiler replace this with a value of type, respectively: int Type.t and (string * int) list Type.t.

This requires patching the compiler and the stdlib as follows:

  • add the Type module to stdlib, with 'a Type.t being a GADT capable of representing all or most of OCaml’s types, including variants and records, and annotations can even be stored as well;
  • patch typing so that if T is a type expression evaluating to type t, then (type T) has type t Type.t;
  • patch code generation so that (type t) generates a value of type t Type.t.

If Type.t is capable enough this would allow us to write serializers and deserializers (JSON, XML, YAML, binary protocols, you name it), including command-line argument parsing from type definitions. One would also be able to write comparison functions but only for non-abstract types, which would admittedly be a little limiting.

If type t is not ground, (type t) can either have typing fail, or have code generation generate Abstract nodes that one cannot do anything with. Abstract types, classes and first-class modules could also fall under this case. The point is not to be able to do that with all types; but to solve the very practical problem of having to write too much boilerplate.

With this and all the new goodies like the monadic let operators that we are getting, I would personally mostly have no use for ppx anymore :slight_smile:

7 Likes

Hey Romain, how are things? :slight_smile:

That seems like a nice proposal, however we are trying to make ppx more robust here, not completely replace it with something else! Please start your own post if you want to discuss this further :stuck_out_tongue:

1 Like

I think also important aspect of the ecosystem is forgotten - documentation. It makes sense to update the the parts of RWO if required:

4 Likes

Absolutely! We were just talking about it with @NathanReb the other day. RWO will definitely be updated once the new APIs are ready.

Please take a look at the ppx_deriving port to ppxlib (which will allow easier port to ppx in the future): https://github.com/ocaml-ppx/ppx_deriving/pull/210
So it will be possible to proceed with the rest of ppx_deriving_* extensions

3 Likes