Best approach at implementing custom backends for OCaml in 2020

I’m hacking OCaml backed targeting Golang, currently in PoC phase and unpublished. So far I decided to build on top of https://github.com/jordwalke/rehp by @jordwalke. Rehp is a fork of js_of_ocaml, that adds a layer of generalized intermediate representation between generate.ml and actual backend producing code in target language. In upstream js_of_ocaml this is hard-wired to produce JS-specific tree right from generate.ml. As far as I know, @hhugo was not opposed to upstream some of work in Rehp into js_of_ocaml.

As of now Rehp got a bit stale, it’s still based on jsoo 3.6 while 3.8 is already out. Also the fact that bytecode is untyped involves some pain when targeting statically typed language, so I wanted to share some thoughts on what’s happening in OCaml backend landscape and get some feedback/ideas from the community to understand what direction is the best nowadays for alternative backend authors.

Golang is flexible enough so that one can write fully dynamic code like in JavaScript, and that’s how I managed to get my backend to support what Rehp emits. But performance is abysmal, basically I have to box everything including integers so that I have one single dynamic type (interface{}) representing any OCaml value. Unfortunately I can’t do the same trick with integers and pointers that OCaml is using in native target, Golang GC is not happy about that, and crashes when it sees pointer which is actually an integer.

I tried to type the bytecode, at least partially, read some whitepapers on gradual type systems, but so far have not achieved any particular success (partially because of lack of time, partially because I’m not really proficient on the subject and had no idea what I’m doing). Not sure if this is at all solvable given only OCaml bytecode, as OCaml memory representation is lisp-style and highly dynamic in nature.

I know @EduardoRFS also tried to type the bytecode, but IIRC with additional type annotations ingested in bytecode, probably coming from typedtree? Given the necessity to have FFI with target language, probably some additional annotations would have to be passed from source representation down to the bytecode to specify concrete types in target language for abstract types used at OCaml level (i.e. I have type t in OCaml with some raw macro style bindings for that type, but in Golang I want it to be *bytes.Buffer so that I don’t have to wrap it in dynamic value container and do type assertions at runtime).

On the other hand @ostera seems to be working with compiler libs and typedtree directly in Caramel (Erlang backend). Original whitepaper on js_of_ocaml suggests typedtree is fragile, while bytecode is stable, and it’s better to use bytecode for project to be sustainable. Is that still true nowadays? Or typedtree is stable enough to rely on it?

There are obviously more radical choices like forking OCaml compiler altogether, like what BuckleScript is doing, but it looks too involved to be feasible.

Would be awesome to hear back from compiler hackers :slight_smile:

7 Likes

To answer on the typedtree part; the typedtree is completely unstable, its representation is highly non-trivial, and the compiler team is not particularly happy with it. Moreover, it is very easy to make mistake when manipulating it.

2 Likes

There is another side effect of using the typedtree is that you’re not exaclty being compliant to the OCaml memory model, so you cannot hack things using Obj.magic and expect to work. Which means not supporting a couple of libraries and it mostly implies in a fork from the ecossystem like what ReScript did.

That’s why I tried to type the bytecode, without using the typings from OCaml typing the bytecode, the types from OCaml were used only to choose what representation was preferred, but the idea was to support both, encoding records as arrays and objects in JS, if a function expecting an object was called with an array I added a fallback(getter and setter), and if it’s in the same compilation unit I could in theory generate a specialized version. And that should only happens when Obj is used, so the performance should be quite stable

The problem is that I have almost zero knowledge of type theory and a simple HM seems to not be enough but I may be wrong, also I didn’t had enough time for it. Well now I’m learning about formal verification and type theory, so maybe I can try it in the future again

This is a good question.

I concur with @octachron, and I think bucklescript’s tendancy to lag behind ocaml versions more and more confirm that it’s really not practical to try to use the internal unstable representation of the compiler. You will get a compiler off the ground reasonably quickly, but maintaining it will be a huge pain. We will see how @ostera fairs :stuck_out_tongue:

Regarding your questions:

  1. There is a lot of typing stuff in the debug info, especially for bytecode, which might be useful to you
  2. You need a way to compile polymorphic function. You will not get around that. traditionally, there are two ways: A) Make your runtime support it (OCaml, JS, …) B) Specialize. If option A is not possible in Go, then you need to use option B, probably at least with an integer version and a “tagged object” version. Specialization is not terribly good with separate compilation, but for a bytecode-approach, that’s almost a non-issue.
1 Like

Thanks for everyone who answered! We have consensus that typedtree is a no go :+1:

simple HM seems to not be enough but I may be wrong

I’ve started implementing some kind of Huet’s approach for unification, put all arrow types, struct types, primitive types on the graph, added equality constraint edges. If solving only direct constraints (i.e. not coming from arrows unification), I don’t see any conflicts and everything looks good. When I try adding constraints coming from arrows unification, things get worse, as some of the functions are polymorphic.

Haven’t looked at unification of structs, so no idea if it’s possible to completely infer struct shapes. Probably this will not work when struct is partially accessed in some function, and reaches that function via some polymorphic function, but maybe I’m just wrong on this.

You need a way to compile polymorphic function. You will not get around that. traditionally, there are two ways: A) Make your runtime support it (OCaml, JS, …) B) Specialize. If option A is not possible in Go, then you need to use option B, probably at least with an integer version and a “tagged object” version.

Both options are possible with Go. Right now I have working polymorphic version, it’s just slow. I thought to type whatever types without conflicts, figure out which functions have conflicting types while resolving equality constraints for functions - if they do have conflicts on some parameter, treat it as polymorphic and emit cast from concrete value to universal box at call site. Does that sound sane?

Specialization is not terribly good with separate compilation, but for a bytecode-approach, that’s almost a non-issue.

Separate compilation is a good goal, compilation times for giant blob of Go code are intimidating, and I want to support separate compilation in distant future. Abilities of type inference will probably decrease with separate compilation though :thinking:

There is a lot of typing stuff in the debug info, especially for bytecode, which might be useful to you

Sounds promising. @Drup can you please give some pointers on where to look for this information? Probably js_of_ocaml does not have this available out of the box.

1 Like

I’ve groomed the project a bit so that it can be shown to public, here it is: https://github.com/Lupus/ocaml2go

1 Like

I’ve googled around for debug info in bytecode, stumbled upon this stack overflow question regarding mapping bytecode to source locations, which gives some pointers on where to look for.

Turns out js_of_ocaml is already reading debug events. I’ve patched rehp a bit:

diff --git a/compiler/lib/ocaml_compiler.ml b/compiler/lib/ocaml_compiler.ml
--- a/compiler/lib/ocaml_compiler.ml
+++ b/compiler/lib/ocaml_compiler.ml
@@ -54,6 +54,9 @@ let rec find_loc_in_summary ident' = function
   | Env.Env_empty -> None
   | Env.Env_value (_summary, ident, description)
     when Poly.(ident = ident') ->
+    Format.printf "ident: %s, value_description: %a\n\n"
+      (Ident.name ident)
+      (Printtyp.value_description ident) description;
     Some description.Types.val_loc
   | Env.Env_value (summary,_,_)
   | Env.Env_type (summary, _, _)

…and the output looks pretty promising (dumping a small piece of it):

ident: float_of_string_opt, value_description: 
val float_of_string_opt : string -> float option  
                                                                     
ident: string_of_float, value_description:
val string_of_float : float -> string                     
                                                  
ident: valid_float_lexem, value_description:           
val valid_float_lexem : string -> string            
                                 
ident: int_of_string_opt, value_description:
val int_of_string_opt : string -> int option
                                                                       
ident: string_of_int, value_description:                                   
val string_of_int : int -> string                
                                                                          
ident: bool_of_string_opt, value_description:                             
val bool_of_string_opt : string -> bool option

@EduardoRFS have you considered using these debug events?

@Drup are these debug events guaranteed to be precise and complete? Or it’s some kind of best effort to help debuggers and relying compilation on this information is a bad idea?

I’ve looked at type info for a reference:

let store = ref None ;;

And type in debug event is still polymorphic, while merlin shows the inferred type perfectly fine in my vs code… Yet for some other cases (like an arrow with obvious argument types) types are concrete. Are debug events emitted before all of type inference machinery worked completely?

Yes I did, and that seems like a bad idea for doing codegen, debug events aren’t not in the compile pipeline and can be not reliable, also they don’t reflect the runtime properties.

They can be used as hints, but using them for codegen seems risky to me.