What would it take to unify the parsetree with the typedtree?

I suspect that I’m not the first one to raise this question, but as I could
not find the answer online, so I’ll ask it here:

What would be the major obstacles to using a single AST
to represent both the parsetree and the typedtree?

Why factorize?

Before elaborating on this question, let me explain why I care so much about it.
Improving the code factorization inside the compiler, as useful as it might
be to decrease the cost of maintaining and extending the codebase, is not
what motivates my question. What I care about is the ability to conveniently
write source-to-source program transformations that need to exploit information
about types, and need to be able to rely on fully-qualified paths for identifiers.

Given that I need type information, it’s natural to start from the typedtree.
Maintaining types during transformation is tremendous effort, plus it’s generally
completely useless, since I could re-typecheck the new code. I thus have the option
of rebuilding a typedtree with broken type information, or rebuilding a parsetree.
Rebuilding a parsetree from the typedtree is possible, but is a rather inconvenient
and awkward thing to do when the input is a typedtree. When in a branch I want
to leave the input program “as it was”, I just expect to return it unchanged.
Moreover, due to small differences between the typedtree and the parsetree, some
extra work is required for producing valid a parsetree (see typing/untypeast.ml).

In summary, it seems that the sensible way to go is to express my program
transformation as a function that takes as input a typedtree with valid types,
and returns a typedtree with dummy type information, and then call the
type-checker again. However, I’m lacking at the moment all the mappers and
ast_helpers on the typedtree data type to help me produce structure. These tools
are provided for the parsetree only. This raises the obvious question: why
wouldn’t the parser also produce a typedtree with dummy type information?
This would also help factorizing all the iterators/mappers/ast_helpers/printers
over the two ASTs.

How to factorize?

This brings us back to the main question that is the concern of this thread.
At a high level, the parsetree has become pretty much a strict subset of the
typedtree. Imo, it no longer makes much sense to live with such a large amount
of code duplication between the parsetree and the typedtree. Concretely,
what I have in mind is that the typedtree could be used for both, simply
using dummy values for fields of type “typ” and “env” and “path” etc. What
matters is to clearly mark, when manipulating an AST, whether its types are
assigned or missing. (For efficiency of multiple program transformations, one
could also imagine an AST with only part of its type information being invalid.)

Over the years, differences between the two AST have been dramatically reduced.
Investigating the files and looking at typing/untypeast.ml, I could spot several
differences, which I think could be handled either by adding a few constructors
in the typedtree without polluting it too much, or in a few cases by requiring
the parser to handle some trivial syntactic sugar. During type-checking:

  • fun gets encoded using function (not convinced this is for the better;
    it seems not very hard for the typechecker to preserve the distinction;
    preserving it would certainly help for the writing of program transformations.)
  • exp_construct has an inherent ambiguity that can only solved during typechecking
    (nevertheless the Texp_construct could be used by the parser, and the typechecker
    could later refine as Pexp_construct
  • exception patterns are cleanly separated from other patterns (parser could do it?)
  • order of arguments in applications involving optional arguments is unclear (?)

Possible migration steps

I do not underestimate the work involved in a major refactoring of the code
base, and the issues associated with code migration of third-party code.
Yet, this cost could be mitigated, by carrying out the migration in several steps:

  • Define the new (unified) AST, with its iterator, mapper, helper, and printer,
    so that it is ready to use for type-aware source-to-source program transformation.
    This is certainly the part that involves the most work. But it’s the part
    that’s unavoidable for being able to write type-aware transformations, which
    is something that I —and I bet others— will certainly end up programming.
    That said, it would be great that the definition of the “ideal” AST be
    the result of an open discusssion with developers of the compiler.
  • Define a conversion function from and to that new AST to the old parsetree,
    so as to allow backward compatibility with ppx transformations. Note that
    one direction would be pretty much exactly the code from untypeast.ml.
    The other direction could be defined with the help of the functions from the
    AST_helper.ml associated with the new AST.
  • Update the code of the typechecker to take as input values from the new AST.
    This involves only very minor and local changes, because the new AST would
    be quite close to the current typed AST.
  • Update the code of the parser to directly produce values in the new AST.
    This would also involve relatively minor changes: essentially, we need to
    replace the constructors from the old parsetree datatype with calls to
    the AST helper functions for the new AST.

At that point, the original parsetree datatype will only need to remain for
backward compatibility with existing ppx code.

I would really like the understand what are the obstacles to be expected
to get there.

The interaction between named arguments and partial application is also
a weird beast:

# let f ~a:x y = (x,y) ;;
val f : a:'a -> 'b -> 'a * 'b = <fun>
# let g = f 1 ;;
val g : a:'a -> 'a * int = <fun>
# g 2 ;;
- : int * int = (2, 1)
# f 1 2 ;;
- : int * int = (1, 2)
# (f 1) 2 ;;
- : int * int = (2, 1)

I think it needs typing information to be determined.

Best,

Alan

The obstacle you give regarding writing typed transformations (which boils down to "I have to use untypeast instead of ast_helpers") are the most minors one.

There are mappers over the typedtree (they’re not as polished as the parsetree ones, but that would not be so difficult to fix) and Untypeast is very reliable nowadays. I’ve even used janestreet’s traversal generators (it’s in ppxlib.traverse nowadays, I think) on the typedtree pretty successfully.

The real obstacle to writing typed transformation is that the typedtree is horrible to manipulate. It’s a data-structure optimized for type inference, it’s full of mutability, complicated things like levels, binders are implicits, there is no abstraction and the complete source language is right in your face. The only people that can use this are the ones that could already contribute to the typechecker’s code.

And your proposal is to put that huge complicated beast everywhere!!
If we want to make typed ppx easier (or, well, reasonably possible), I think we should go in the opposite direction and hide the typedtree in favor of a less overwhelming immutable data-structure with explicit markers that can be safely approached by users.

2 Likes

Ok, I think that I understand what you are trying to explain: we need an AST that carries type information but with a representation of types based on a clean, immutable data structure, moreover with explicit binders for type variables. (In fact, for my CFML tool I already had to hack a typed tree extended with such explicit type binders.) You’re right that having a clean data type for “types” independent of type inference would make it much easier for the end user to manipulate types.

So, let’s assume that we have such a clean AST, carrying “clean” types. Now, I can reformulate my question in an even more ambitious manner: can we have a single AST that is common to (1) the output of parsing, (2) the process of type inference, and (3) the construction of the clean typed AST usable for program transformations?

Indeed, I certainly don’t want to encourage OCaml developers to maintain 3 versions of the AST… —I was already complaining about having 2 distinct ones. I’m not sure about the technical details, but I suspect that it should be doable to factorize the three AST. The only thing that changes is the representation of type annotations present on every AST node and identifiers.

One possibility is that a type annotation be one of “Notype” or “Internal pointer used for type inference” or “Clean immutable representation of the type.”. It would be manageable I think to provide accessor functions to query the types in the desired form. Another possibility is to set up type inference so that it would build an AST whose type annotations are mutable fields that, at the end of the typechecking process, get replaced by the immutable representation of the type of the corresponding nodes.

There was quite a bit of work (by Alain Frisch, iirc) to make the typedtree and the parsetree very close, and keep all syntactic information. So, in theory, there is not much to do to achieve that “except” figuring out how to add one slot for annotations everywhere. That annotation could either be a polymorphic type, or an extensible sum type.

Unfortunately, I’m not sure it’s so easy to do that in practice. Typing annotations in the typedtree are not so regular. I guess we would have to try and see what doesn’t fit.

Could you not do this fairly easily by reusing Untypeast (as suggested in the other topic), which is written in open-recursion style?

@Drup: I believe that the work to add parsetree-relevant information to the typedtree was done by OCamlPro (in particular Tiphaine Turpin, Thomas Gazagnaire, and Fabrice Le Fessant) in the context of the Typerex project – better support for IDEs on top of OCaml.

Implementing a source-to-source transformation as a mapper extending Untypedast, i.e. as a function from a typedtree to a parsetree, is technically possible but really inconvient without an additional layer of ast_helpers.

Consider an ugly case from Untypedast, e.g.

   | Tpat_construct (lid, _, args) ->
        Ppat_construct (map_loc sub lid,
          (match args with
              [] -> None
            | [arg] -> Some (sub.pat sub arg)
            | args ->
                Some
                  (Pat.tuple ~loc
                     (List.map (sub.pat sub) args)
                  )
          ))

If I want to do some transformation on a tpat_construct, I’ll have to reproduce this logic.

To avoid duplication of code pattern, I could introduce a smart constructor for Ppat_construct that takes arguments “similar to” those of Tpat_construct.

   | Tpat_construct (lid, _, args) ->
        Helper.Pat.construct (map_loc sub lid) (List.map (sub.pat sub) args)

where

 module Helper = struct
    module Pat = struct
       let construct lid args =
           let p = match args with
            | [] -> None
            | [arg] -> Some (arg)
            | args -> Some (Pat.tuple args)
            in
            Ppat_construct (lid, p)
    end
 end

Arguably, this is how Untypedast.ml should be implemented in the first place, so I could have a pull request for this Helper module.

Then remains the point of @Drup, which is that the interface available for manipulating types in the typed-AST are probably more complex than it could be. But that’s another story…

There is an parsing/ast_helper module with Pat module with the following not-smart-enough function:

val construct: ?loc:loc -> ?attrs:attrs -> lid -> pattern option -> pattern

You should indeed feel free to send a PR to add

val construct_n: ?loc:loc -> ?attrs:attrs -> lid -> pattern list -> pattern

I see a couple other improvements (Texp_construct would benefit from the same helper, Texp_extension_constructor needs help, exception patterns can be improved, etc.) from looking at Untypeast this way, but nothing that would clearly require a fundamentally different approach.