I suspect that I’m not the first one to raise this question, but as I could
not find the answer online, so I’ll ask it here:
What would be the major obstacles to using a single AST
to represent both the parsetree and the typedtree?
Why factorize?
Before elaborating on this question, let me explain why I care so much about it.
Improving the code factorization inside the compiler, as useful as it might
be to decrease the cost of maintaining and extending the codebase, is not
what motivates my question. What I care about is the ability to conveniently
write source-to-source program transformations that need to exploit information
about types, and need to be able to rely on fully-qualified paths for identifiers.
Given that I need type information, it’s natural to start from the typedtree.
Maintaining types during transformation is tremendous effort, plus it’s generally
completely useless, since I could re-typecheck the new code. I thus have the option
of rebuilding a typedtree with broken type information, or rebuilding a parsetree.
Rebuilding a parsetree from the typedtree is possible, but is a rather inconvenient
and awkward thing to do when the input is a typedtree. When in a branch I want
to leave the input program “as it was”, I just expect to return it unchanged.
Moreover, due to small differences between the typedtree and the parsetree, some
extra work is required for producing valid a parsetree (see typing/untypeast.ml).
In summary, it seems that the sensible way to go is to express my program
transformation as a function that takes as input a typedtree with valid types,
and returns a typedtree with dummy type information, and then call the
type-checker again. However, I’m lacking at the moment all the mappers and
ast_helpers on the typedtree data type to help me produce structure. These tools
are provided for the parsetree only. This raises the obvious question: why
wouldn’t the parser also produce a typedtree with dummy type information?
This would also help factorizing all the iterators/mappers/ast_helpers/printers
over the two ASTs.
How to factorize?
This brings us back to the main question that is the concern of this thread.
At a high level, the parsetree has become pretty much a strict subset of the
typedtree. Imo, it no longer makes much sense to live with such a large amount
of code duplication between the parsetree and the typedtree. Concretely,
what I have in mind is that the typedtree could be used for both, simply
using dummy values for fields of type “typ” and “env” and “path” etc. What
matters is to clearly mark, when manipulating an AST, whether its types are
assigned or missing. (For efficiency of multiple program transformations, one
could also imagine an AST with only part of its type information being invalid.)
Over the years, differences between the two AST have been dramatically reduced.
Investigating the files and looking at typing/untypeast.ml, I could spot several
differences, which I think could be handled either by adding a few constructors
in the typedtree without polluting it too much, or in a few cases by requiring
the parser to handle some trivial syntactic sugar. During type-checking:
- fun gets encoded using function (not convinced this is for the better;
it seems not very hard for the typechecker to preserve the distinction;
preserving it would certainly help for the writing of program transformations.) - exp_construct has an inherent ambiguity that can only solved during typechecking
(nevertheless the Texp_construct could be used by the parser, and the typechecker
could later refine as Pexp_construct - exception patterns are cleanly separated from other patterns (parser could do it?)
- order of arguments in applications involving optional arguments is unclear (?)
Possible migration steps
I do not underestimate the work involved in a major refactoring of the code
base, and the issues associated with code migration of third-party code.
Yet, this cost could be mitigated, by carrying out the migration in several steps:
- Define the new (unified) AST, with its iterator, mapper, helper, and printer,
so that it is ready to use for type-aware source-to-source program transformation.
This is certainly the part that involves the most work. But it’s the part
that’s unavoidable for being able to write type-aware transformations, which
is something that I —and I bet others— will certainly end up programming.
That said, it would be great that the definition of the “ideal” AST be
the result of an open discusssion with developers of the compiler. - Define a conversion function from and to that new AST to the old parsetree,
so as to allow backward compatibility with ppx transformations. Note that
one direction would be pretty much exactly the code from untypeast.ml.
The other direction could be defined with the help of the functions from the
AST_helper.ml associated with the new AST. - Update the code of the typechecker to take as input values from the new AST.
This involves only very minor and local changes, because the new AST would
be quite close to the current typed AST. - Update the code of the parser to directly produce values in the new AST.
This would also involve relatively minor changes: essentially, we need to
replace the constructors from the old parsetree datatype with calls to
the AST helper functions for the new AST.
At that point, the original parsetree datatype will only need to remain for
backward compatibility with existing ppx code.
I would really like the understand what are the obstacles to be expected
to get there.