Is there any documentation specifying “.cmo” file format? I have been searching and finding very little documentation on how the OCaml VM and bytecode works, there are some rather old but hopefully still valid docs at cadmium: http://cadmium.x9c.fr/distrib/caml-instructions.pdf -> the ISA of the VM
But for the specification of the “.cmo” I have found nothing. The closest I got, was peeking into the source code of the compiler and tools but it’s not a great way.
magic number (Config.cmo_magic_number)
absolute offset of compilation unit descriptor
block of relocatable bytecode
debugging information if any
compilation unit descriptor
where the compilation unit descriptor is a marshaled value of type Cmo_format.compilation_unit.
Thank you Ivan, this is a great starting point (this is the source I was talking about). But still I’m having a hard time disassembling a “.cmo” file. I have made a simple hello world example to play with: https://github.com/ImanHosseini/CMODisas
Looking into the “.cmo” file in HexEditor:
We see a magic number (weirdly “OCP-199O009” which I don’t know where it comes from) and then the offset of compilation unit as you said. After it we expect a block of bytecode but we are not getting it I guess? I don’t know what that 0x35 is, can’t be a valid opcode.
So as per the cadmium links in CMODisas, there is a document with bytecode specs and how the data types work. Something missing is for example, how the code is actually laid out in memory: I’d imagine for each instruction the opcode is 1 byte, and then if there are operands, each operand comes after but again nowhere in the doc says how many bytes each operand can be and/or if there are any alignment constraints.
Also the document regarding bytecode file format speaks of sections named “CODE”/“DATA”/… which are not existent here. [in the generated “.exe” file these exist at the very end of the file]
The “OCP-1999O009” magic number corresponds to OCamlPro’s version of the compiler (more sepcifically, 4.02.1+ocp1). There is no guarantee that it is compatible with mainstream bytecode, though in practice if you find a tool that works for mainstream 4.02.1 it should work here as well. I’m curious about how you came upon such a file and why you’re trying to disassemble it, though.
You might want to add the OCaml bytecode support to radare2 reverse engineering framework, so you will have more features helping to disassemble it effectively.
See the Plugins - Disassembly and Plugins - Analysis radare2 book chapters. Also feel free to drop me a private message in case of the questions. And of course, it has an OCaml bindings: https://opam.ocaml.org/packages/radare2/
The offset after the magic word is pointing to the compilation unit entry, which is written in OCaml Marshal format (parseable only with the OCaml input_value function). Once you read it, you can look into the cu_pos and cu_codesize parameters which will point you to the bytecode, e.g.,
let ic = open_in_bin filename in
let len_magic_number = String.length cmo_magic_number in
let magic_number = really_input_string ic len_magic_number in
if magic_number = cmo_magic_number then begin
let cu_pos = input_binary_int ic in
seek_in ic cu_pos;
let {cu_pos; cu_codesize} = (input_value ic : compilation_unit) in
close_in ic;
(* do the disassembling on the cu,cu_pos+cu_codesize region *)
end
Also, there is already an OCaml disassembler, see the tools/dumpobj.ml file. It will read the code and even print the bytecode.