".cmo" specification

Hi,

Is there any documentation specifying “.cmo” file format? I have been searching and finding very little documentation on how the OCaml VM and bytecode works, there are some rather old but hopefully still valid docs at cadmium:
http://cadmium.x9c.fr/distrib/caml-instructions.pdf -> the ISA of the VM

But for the specification of the “.cmo” I have found nothing. The closest I got, was peeking into the source code of the compiler and tools but it’s not a great way.

A cmo file is having the following layout

     magic number (Config.cmo_magic_number)
     absolute offset of compilation unit descriptor
     block of relocatable bytecode
     debugging information if any
     compilation unit descriptor 

where the compilation unit descriptor is a marshaled value of type Cmo_format.compilation_unit.

The rest of the files are described in the same folder: https://github.com/ocaml/ocaml/tree/trunk/file_formats

3 Likes

Thank you Ivan, this is a great starting point (this is the source I was talking about). But still I’m having a hard time disassembling a “.cmo” file. I have made a simple hello world example to play with: https://github.com/ImanHosseini/CMODisas
Looking into the “.cmo” file in HexEditor:


We see a magic number (weirdly “OCP-199O009” which I don’t know where it comes from) and then the offset of compilation unit as you said. After it we expect a block of bytecode but we are not getting it I guess? I don’t know what that 0x35 is, can’t be a valid opcode.
So as per the cadmium links in CMODisas, there is a document with bytecode specs and how the data types work. Something missing is for example, how the code is actually laid out in memory: I’d imagine for each instruction the opcode is 1 byte, and then if there are operands, each operand comes after but again nowhere in the doc says how many bytes each operand can be and/or if there are any alignment constraints.
Also the document regarding bytecode file format speaks of sections named “CODE”/“DATA”/… which are not existent here. [in the generated “.exe” file these exist at the very end of the file]

The “OCP-1999O009” magic number corresponds to OCamlPro’s version of the compiler (more sepcifically, 4.02.1+ocp1). There is no guarantee that it is compatible with mainstream bytecode, though in practice if you find a tool that works for mainstream 4.02.1 it should work here as well. I’m curious about how you came upon such a file and why you’re trying to disassemble it, though.

1 Like

You might want to add the OCaml bytecode support to radare2 reverse engineering framework, so you will have more features helping to disassemble it effectively.
See the Plugins - Disassembly and Plugins - Analysis radare2 book chapters. Also feel free to drop me a private message in case of the questions. And of course, it has an OCaml bindings: https://opam.ocaml.org/packages/radare2/

2 Likes

The offset after the magic word is pointing to the compilation unit entry, which is written in OCaml Marshal format (parseable only with the OCaml input_value function). Once you read it, you can look into the cu_pos and cu_codesize parameters which will point you to the bytecode, e.g.,

let ic = open_in_bin filename in
  let len_magic_number = String.length cmo_magic_number in
  let magic_number = really_input_string ic len_magic_number in
  if magic_number = cmo_magic_number then begin
    let cu_pos = input_binary_int ic in
    seek_in ic cu_pos;
    let {cu_pos; cu_codesize} = (input_value ic : compilation_unit) in
    close_in ic;
    (* do the disassembling on the cu,cu_pos+cu_codesize region *) 
  end

Also, there is already an OCaml disassembler, see the tools/dumpobj.ml file. It will read the code and even print the bytecode.

1 Like