How to "rewind" when traversing a syntax tree?

Hi!

I have a situation when traversing my AST where I need to “rewind” and output something in the previous match case.

Case in point:

local p = new Point {1, 2};

should generate the following C-code:

Point __p = {1, 2};
Point *p = &p;

But the AST is:

Assignment (
    Struct_typ (Local, "Point"),
    "p",
    New (
        "Point",
        [
            (("x", Int), Num 1);
            (("y", Int), Num 2);
        ]
    )
);

so when matching on it, the “new” is discovered “too late”. Further more, the line can be seen as syntactic sugar for

let p = new local Point{1, 2};

I’m on the fence where to put the locality kind here… :thinking:

Possible solutions:

  • Scan syntax tree twice to insert needed lines
  • Don’t make “compile_tree_to_c_code” return a string, but instead use it to put lines into a queue which can be manipulated.

Any ideas? Is there an off-the-shelf solution?

Olle

(Full code: escape-c/ast.ml at master · olleharstedt/escape-c · GitHub)

While it might be far from an optimal solution, you can pattern match on nested constructors, and use a wildcard for others, like so:

match term with
| Assignment (ty, id, New (struct_name, struct_init)) -> (* do something *)
| Assignment (ty, id, expr) -> (* do another thing *)

This lets you handle the specific case up-front instead of backtracking, but it can be hard to scale to a lot of cases. I’m still learning about compilers myself so I’m not totally sure what the best solution is but I hope this is a good stopgap!

That doesn’t scale well. Imagine the following line:

local points = [new Point {1, 2}, new Point {2, 3}];

which should evaluate to:

Point __p = {1, 2};
Point __p2 = {2, 3};
Point* points = [&__p, &__p2];

So each time the traversal stumbles upon a “new”, it has to know which implied locality to apply. So a state in the traversal…? Or there needs to be another traversal first to insert the correct locality into the tree.

Yeah it sounds like there’ll need to be some statefulness (not necessarily mutable though) - a map with bindings seems common when doing this sort of traversal in a type-checker at least (I still haven’t gotten around to codegen in my project, so I’m not all that familiar).

Perhaps your recursive function could return both the string of code, and a map of bindings it made (or something similar)? That way your leaf nodes can tell their parents that there’s a value to do stuff with.

I know this is super vague but i’m still figuring this stuff out myself :sweat_smile:

Oh yeah, returning something more than a string might be an interesting take… :thinking:

You can probably, just add an additional argument to you function which holds a list of visited nodes (so that you can go back, if there is need).

Also, zippers. But usually, that’s not very ergonomic

Or maybe put the strings into a tree that is rendered breadth first? Might give more control over the order.

[I could be wildly mistaken here, but] This smells like some sort of “give names to subexpressions and then translate” problem? That is to say, you want to first do a sort of flattening of the complex expressions, so that each expression is a simple function-call, with arguments that are themselves only values or variables? Felleisen & Sabry call this “A-Translation”, and this general idea of “name nontrivial subexpressions and flatten them so that they are sequentially laid-out” is often useful as part of converting to some simpler language. You could implement this by iteratively rewriting your program, e.g.

local t = new Tree { new Tree { new Tree { new Leaf { 1}, new Leaf {2} }, 
                                                  new Leaf { 3} } ,
                                new Leaf { 4 } }

The outermost rewrite gives

local t1 = new Tree { new Tree { new Leaf { 1}, new Leaf {2} }, new Leaf { 3} }
local t2 = new Leaf { 4 }
local t = new Tree { t1 , t2 }

and then the next rewrite start at the earliest term that was modified, giving

local t3 = new Tree { new Leaf { 1}, new Leaf {2} }
local t4 = new Leaf { 3}
local t1 = new Tree { t3, t4 }
local t2 = new Leaf { 4 }
local t = new Tree { t1 , t2 }

and so on. The idea that your language should be capable of naming all intermediate expressions (or you’ll need to add that capability), and then you do all the rewriting in-language, only converting to C after you’ve got things in a sufficiently explicit form.

2 Likes

Did some testing, and feel like a second pass would be easiest. This is a toy example for a syntax tree with only two nodes: assignment and stack_alloc:

type stmt = Assignment | Stack_alloc

let rec insert_stack_alloc (stmts : stmt list) = match stmts with
    | [] -> []
    (** TODO: Needs guard to check for New expressions, and a function to
              fetch all those expression *)
    | Assignment::tail -> Stack_alloc :: Assignment :: insert_stack_alloc tail
    | Stack_alloc::_ -> failwith "Shouldn't visit Stack_alloc in this pass"

Instead of rewinding just do not emit prematurely :slight_smile: The common technique, used in AST folding, is maintaining a stack for keeping a partially ready example and emitting the code once it is ready. The stack, basically, is a reverse polish notation (RPN) of your currently visited expression. This is a common technique used in uplifting from lower to higher-level languages, e.g., in decompilers and other lifters.

2 Likes

Cool, thanks! Didn’t know. A stack makes sense.