Interfacing C++ with OCaml

Hi, I am writing a simple OCaml binding for C++ code. My intention is to construct a list of C++ pointers from the C++ side and pass to the OCaml code. But it seems GC improperly reclaims active list elements that are not yet visited. While iterating the list in my OCaml code, it crashes with segmentation fault.

Here is my code snippet:

(* C++ *)
value clang_function_decl_get_params(value Param) {
  CAMLparam1(Param);
  CAMLlocal3(Hd, Tl, Tmp);
  clang::FunctionDecl *P = *((clang::FunctionDecl **)Data_abstract_val(Param));
  Tl = Val_int(0);
  for (unsigned int i = P->getNumParams(); i > 0; i--) {
    Hd = caml_alloc(1, Abstract_tag);
    *((const clang::ParmVarDecl **)Data_abstract_val(Hd)) = P->getParamDecl(i - 1);
    Tmp = caml_alloc(2, Abstract_tag);
    Store_field(Tmp, 0, Hd);
    Store_field(Tmp, 1, Tl);
    Tl = Tmp;
  }
  CAMLreturn(Tl);
}
(* OCAML *)
extern get_param : t -> Param.t = "clang_function_decl_get_params"
let params = get_params in
List.iter (....) params

Is there anything I miss here?

Why do you use Abstract_tag = 251? If you are intended to create a cons cell, then the tag should be 0.

Indeed. That solves the problem. Thanks a lot!

One more question.
How does the OCaml GC compute heap reachability?
For example, an OCaml code acquires a C++ pointer of a class. Does the GC preserve all the field including integers, objects, etc?
My ocaml program acquires a pointer p. But it seems that p->a.x is released by GC unexpectedly.

The OCaml GC only keeps track of OCaml values. If you store an arbitrary pointer in an OCaml block (by using Abstract_tag or some other encoding) it will not be traced by the GC at all.

Cheers,
Nicolas

Shameless plug: did you consider using clangml? There are Clang.ext_function_decl_get_num_params and Clang.ext_function_decl_get_param_decl in the bindings, and more high level functions to access Clang’s AST from OCaml.

2 Likes

@nojb Thanks a lot. So it means that without any additional mechanism, p->a.x will be garbage collected.

@thierry-martinez I am actually a big fan of clangml and actively using it. I appreciate it. Currently, I try to understand Clang more deeply as well as Ocaml-C interfacting. How does clangml keep everything from GC? For example, if your OCaml program just holds a pointer to only a function declaration or a translation unit at some point, how do you prevent GC from reclaiming other sub nodes in AST such as param, type, etc (as they do not have OCaml values at this moment)?

You would need to explain what p->a.x is in order to answer this question with any precision, but yes, in general the GC is unable to keep track of values which are stored in data structures allocated outside of the OCaml heap. In that case, you typically need to register the OCaml pointer explicitly with the GC to keep it from being garbage collected, see OCaml - Interfacing C with OCaml (specifically the caml_register_global_root and caml_register_generational_global_root function).

Cheers,
Nicolas

The C interface of clangml uses tuples (OCaml blocks) to store dependencies: for instance, values of opaque type Clang.cxcursor are represented in memory as pairs (ptr, tu) where ptr is a custom block storing a pointer to libclang’s CXCursor, and tu is the Clang.cxtranslationunit, which the cxcursor belongs to. Therefore, from the point of view of OCaml’s GC, as long as there is a reachable cxcursor, the underlying cxtranslationunit is reachable. The same mechanism is used for all other values (e.g., cxtype, cxsourcerange, …) that depend from a given cxtranslationunit, and for cxtranslationunit themselves, which refer to their underlying cxindex.

@thierry-martinez Thanks a lot. That part I understand. But my question is about the following case:
Suppose a TU has multiple ASTs and my OCaml program only has one pair (ptr, tu) where ptr is a pointer to one of the ASTs at a moment. In this case, OCaml GC does not know the actual reachability made by the C part (i.e., all the other ASTs are reachable from TU). In this case, does GC can erase other AST nodes?

The OCaml representation of the other AST nodes will be collected, but the C representation is stored by the translation unit and is kept as long as the translation unit is reachable: that is necessary since we can still access to the other nodes, and even rebuild the OCaml representation, from the reference to the translation unit.

Thanks. Then, I am a bit confused and wondering how I should understand this comment from @nojb.

  1. Clang AST nodes allocated by Clang (C program).
  2. OCaml GC does not understand reachability made by C pointers (i.e., TU → function → stmt → type etc…).
  3. My ocaml program only has a handler to the TU.

How does 3 prevent GC from erasing all the other AST nodes?

OCaml GC only takes care of OCaml heap. Clang AST nodes are allocated (by clang!) outside OCaml heap, and the C pointers stored in OCaml data structures are in custom blocks, which are opaque for the GC.

Regarding OCaml pointers stored in foreign data structures, a very common pattern (according to my local checkout of the opam repo) is to do malloc(sizeof(value)), register the pointer as a generational global root, store the value inside it, and store the pointer in the data structure. This is maybe what @nojb was referring to.

We have a library for this use-case: ocaml-rust / ocaml-boxroot · GitLab. Functionally it works as the above but in a much more efficient way. (You can even use it to get rid of CAMLparam, CAMLlocal and CAMLreturn!) If there’s popular demand from C++ folks, I might write a smart pointer for it.

3 Likes

I understand. Thanks all!