How to compile from C to OCaml?

keywords: C, compilation, OCaml, AST
(it was not possible to use or create these keywords)

I’m facing the following situation:
I received an old C program that should be about 10-20 kLOC. I would like to implement it in OCaml and to improve it. But my knowledge of C language and tooling is very very basic.
Is there an easy and reliable solution to compile this C program to OCaml? At least, this should produce all signatures and enough fragments of C structures translated to OCaml expressions.

I imagine that compiling some C source should produce some AST. So, I also imagine that some translation/compilation to an OCaml AST should be possible (this is certainly the key question). Once landed in the OCaml world, I’m confident that I should get some OCaml expressions.

btw, I compile OCaml programs everyday to get some exe, but I’ve never played in the reverse direction or even inspected intermediary files.
What is the minimum set of commands required to produce an AST from a set of .ml files, and reversely to produce some .ml files from an AST file?
Thanks.

Typically, no. C is much lower-level than ML or any other high-level language.

Why not just figure out how to compile the C program? On what OS/hardware did it last run successfully? Do you have tests for it?

The C program compiles and runs on x86 and x86_64.
This program rarely crash, but it does sometimes, because it’s not certified. Then a patch is written.
Ideally, I believe that using a tool like frama C could be a solution for improving/securing it. But I’ve no experience at all with frama C.
As the architecture of the C program has to be redefined, maybe a solution could be to use unchanged/most reliable C components and use OCaml to implement higher level components (the same as when using OCaml + fortran for intensive computing).

Tests? I don’t have enough information yet about that. But there is certainly enough testing materials.

Difficult question (out of a concrete context):
do you have a rule of thumb to compare the speed of an OCaml program (optimized with imperative functions) and of a C program? -10% ? -50%? -90%?!
-50% would be acceptable for some applications.

Also: one often reads that the OCaml compiler produces native code which is fast, not far from C. As C is lower level than OCaml, how should one understand that statement? (What are we really talking about?..)

For years decades the Ocaml implementers have been very honest about performance comparisons. Uncharacteristically so, compared to implementers of other high-level languages (“Java is as fast as C++”). Ha! I suspect a decent rule of thumb is that well-written Ocaml will perform half the speed of well-written C. Of course, well-written C is much more expensive to produce, than well-written Ocaml … but then again, you already have your C program. So I’d say you should be lucky to get a 2x slowdown moving to Ocaml.

I’m an experienced C programmer, so this might not be helpful, but … you might also consider trying to compile with a good C++ compiler and use that to push thru as much code-cleanup as you can. Also, there are bunch of code-verification products out there which might be useful. When it comes to both compliers and verification, trying as many as possible is a good plan: I remember a certain database product used to compile with G++, Intel’s compiler, and IBM’s C++ compiler – because each of them caught different bugs. Solely for bug-finding.

If I were in a position where I wasn’t very C-knowledgeable, and MUST NOT modify the C code (or modify it as little as tolerable), I would not try to run it in the same process with a high-level-language runtime: an unknown C code might have lots of interesting pointer-errors[2], and gee, that would be so much fun when they blow up my high-level-language runtime, they really would. Instead, I would try to cut apart the C program into two parts (high-level and low-level) separated into two processes.[1] Here, an RPC system like Thrift might help (as would compiling with a C++ compiler). Then once I have the system running that way, I could re-implement the high-level stuff in our favorite language.

[1] of course, this plan depends on details of the performance of the C program, and whether it -can- be cut apart, etc.

[2] OTOH, if you have high confidence in the C code, why bother changing it? Just go with it as-is grin

Thanks for your detailed explanation.

Indeed, I feel that you are very wise about not mixing OCaml and C in the same process. Also, a RPC system (Thrift, etc.) is a good idea for separating high level from low level C stuff, before reimplementing high level stiff in OCaml.

However, this C program is old (which also means more patched and somehow more robust…) and there is quite no more support. And I’m not experienced in C.
This situation is more an opportunity to think about how to use OCaml to program from scratch at “lower level”.

Getting approximately -50% speed would be acceptable.

To go ahead, my main concern is now to extract some specification from the C program. Today, the specification of the program is the C code itself.

It would be useful to get quickly an overview of the C source code instead of reading a bunch of C source files.
Do you think there is a simple way to leverage existing tools to do that, and produce some textual map?
Ideally, getting the OCaml signatures from a/this C program would be great. even if they are low level signatures. I first need to understand the concepts involved in that program.

Totally naive (and crazy?) side question: what about isolating much more components from this C program, and not only high/low level stuff? Just want to know if it’s feasible, even if speed is reduced by 30 or 60%.

Not OCaml, but there is a transpiler from C to Rust - c2rust. Theoretically, something similar can be made for OCaml too, but there is simply not enough interest to do that.

For navigating C source code there are three options:

  • Understand - commercial software
  • Sourcetrail - open source software
  • I personally use ccls for more in-editor search and sources navigation

Sourcetrail looks great for exploring unknown and big C programs.
ccls is a language server, so it should provide all relevant information on/around each expression (hover, completion, jump to definition, etc.). I didn’t notice visual features like sourcetrail has. Is there any?

Let’s take the non representative/very simple example of the insertion_sort function shown (as of today) at https://c2rust.com/: the semantics could be handled in OCaml as well without any gap.

C insertion_function (copy)
void insertion_sort(int const n, int * const p) {

    for (int i = 1; i < n; i++) {
        int const tmp = p[i];
        int j = i;
        while (j > 0 && p[j-1] > tmp) {
                p[j] = p[j-1];
                j--;
        }
        p[j] = tmp;
    }
}

Do you have an idea of typical situations where a C expression could not be mapped to/translated into an OCaml expression?

btw, does it exist something like sourcetrail for OCaml? It would be helpful to handle an unknown OCaml program.

EDIT:
As an exercise, I could write this insertion_sort function in imperative OCaml:

let insertion_sort (p:'a array) : unit =
  for i = 1 to (Array.length p) - 1 do
    let tmp = ref p.(i) in
    let j = ref i in
      while !j > 0 && p.( !j - 1 ) > !tmp do 
        p.( !j ) <- p.( !j - 1 );
        decr j;
      done;
      p.( !j ) <- !tmp;
  done
(* This polymorphic function uses stdlib polymorphic comparison operator ; 
    instead, it should use a monomorphic comparison function as an additional argument *)

As a side remark: it’s readable and easy to maintain although once it’s written it’s just supposed to be used.
It’s type safe and it seems sound regarding bounds (empty array).
However, I don’t know how it could behave with very large arrays with elements of complex types for which a specific comparison operator would be provided (no formal guarantee).

Nevertheless, if I would have a bunch of functions like that in a big OCaml program, I would still need a way to have an overview of all functions/modules and how functions are called.