I thought I would share some of my experiences having just completed some work in customising the compiler, this mainly seeks to give some advice/thoughts for your first point on understanding the compiler, as that’s all I feel qualified to answer.
A little background: I was customising the compiler to emit custom instructions for RISC-V - this meant having a fairly good grasp of the overall structure of the compiler pipeline (although not the nitty-gritty details of each transformation).
Understanding the OCaml pipeline:
The first thing I did, thanks to @dra27’s advice, was to think of a new keyword (for me it was
camel) which was like a unary operator for simply adding one to a given number (or 2 in the OCaml tagged-bit representation). In the end I pushed this through the compiler to the bytecode-level. It gave me a good understanding of the general structure of the front and middle-end as you pass from intermediate representation (IR) to IR. The marvels of static type-checking mean this is almost road-mapped for you as once you’ve added the new keyword lots of pattern-matches become non-exhaustive and your job is simply to add an extra-case (but to also understand what each is doing).
As part of my CS degree we did a Compiler Construction course (using OCaml) which constructs a very simple (mostly interpreted) language called Slang (https://github.com/Timothy-G-Griffin/cc_cl_cam_ac_uk/tree/master/slang). Looking at the source-code for this might be a more digestible way of understanding the tools like the parser and lexer (
.mll) than jumping in at the deep-end with the OCaml ones.
I think these posts about some of the OCaml internals are great to explain some of the design decisions in terms of data format (https://rwmj.wordpress.com/2009/08/04/ocaml-internals/) as well as the aforementioned Real World OCaml “The Compiler and Runtime System” section. The OCaml manual, which is very thorough, also has a section about interfacing with C which is somewhat useful for understanding the data representations and using some of the runtime code to make this work (https://caml.inria.fr/pub/docs/manual-ocaml/intfc.html).
The compiler itself has the option of printing different IRs (parsetree, typetree, lambda, clambda, cmm, mach, linear…) using the -d<IR_name> whenever you run ocamlopt. This is a great tool for (1) analysing the different IRs (how data is represented etc.) and (2) understanding what high-level abstractions are being removed from IR to IR.
As a small example compiling the following with
ocamlopt -dclambda -dcmm -c:
let a = 3
(seq (setfield_ptr(root-init) 0 (read_symbol camlTest) 3) 0a)
(data int 1792 global "camlTest" "camlTest": int 1)
(function camlTest__entry () (store val(root-init) "camlTest" 7) 1a)
There’s quite a lot going on here, but one thing you can note from this is the translation from the original
3 value to the OCaml
7 value (
3 << 1 + 1). This is a very small example, but outlines a general method for understanding how OCaml represents the source-code at different levels of abstraction.
asmcomp directory contains the code for generating assembly (the back-end of the compiler). I think writing small programs and passing the
-S compiler flag and analysing the generated assembly is incredibly useful, especially if you then try tracking back up the compiler pipeline to see where the assembly came from (one good example would be the jump tables used for larger pattern matches).
AFAIK, there’s no single document which describes everything. This is why playing with the compiler is so useful, but can be intimidating/hard to decide what small piece you want to understand first. What I’ve tried to show are a few ways of learning about individual parts so you can start building up the bigger picture which has worked for me so far.
These are just some of the things that helped me come to grips with the compiler, hopefully it can help you too.
- Adding a custom keyword that does something very simple.
- Writing a toy-language (say the Lambda Calculus) using OCaml tools really helped me understand the front-end of the compiler.
- Printing lots of programs using the different IRs (
ocamlopt -c -dclambda... myfile.ml)
- Outputting the generated assembly and inspecting that file before working back through the compiler to understand how your source-code became that assembly (
ocamlopt -S myfile.ml)