[ANN] clangml 4.2.0: OCaml bindings for Clang API (for C and C++ parsing)

Dear OCaml users,

We are happy to announce the new clangml 4.2.0 release. Clangml provides bindings for all versions of Clang, from 3.4 to the not yet released 10.0.1.

The library can be installed via opam: opam install clangml
The documentation is online: https://memcad.gitlabpages.inria.fr/clangml/

This new release improves C++ support, including C++20 specific constructs.

All Clang C/C++ attributes should now be supported. You may have a look to the interface of the new auto-generated module Attributes.

There is now a lazy version of the AST (Clang.Lazy.Ast): this is useful to explore large ASTs efficiently (note that Clang parsing itself can still be slow; the lazy part only concerns the conversion into the Clang.Lazy.Ast datatypes).

Happy hacking!

11 Likes

Potentially could be useful for automated OCaml bindings generation based on the C/C++ headers parsing, like one of the listed solutions in Advanced C binding using ocaml-ctypes and dune. The closest model would be Python’s ctypeslib.

4 Likes

Impressive, now people will start to analyze C++ code.
Must be tough.

1 Like

Automated bindings generation is one of the motivation for this work, indeed, and the bindings to the Clang API themselves are auto-generated that way (but the generators are still quite ad-hoc).

1 Like

I also started experimenting a couple months ago on a tool/library that uses clangml to generate C bindings from C headers. (at that time this uncovered a couple bugs in clangml and @thierry-martinez was very quick in fixing them, so many thanks to him)

My use case was to generate bindings for a library with a large API surface (many functions, many public structures with many fields), that evolves over time, but which follows consistently a set of conventions through its API. Furthermore, the library manages memory in a way that is quite convenient for FFI bindings (no ownership transfer, the library manages its own memory and only produces handles (structure pointers) to memory it manages itself—the lifetime of data is less obvious but it’s less crucial for OCaml bindings).

Here’s a summary of some of the design constraints and observations that I can remember:

  • The generator must produce wrapper code that directly corresponds to an idiomatic OCaml API. The goal is to avoid manually going through the whole surface of the C API, so there is no point in e.g. generating a description of the library in some C-like OCaml eDSL (e.g. ctypes) if one then needs to manually write wrapper code on top of that.
  • Thanks to the fact that the C library implements some conventions and uniform memory discipline, it should be possible to write generator code that takes advantage of these conventions.
    For instance, the memory management discipline of the library meands that I can build in my generator that for any structure s_foo the library exposes, a value of type *s_foo can always be represented on the ocaml side as an integer value corresponding to the value of the pointer, and exposed as an abstract type s_foo (complemented with accessor functions for the fields of the structure)—without having to copy data around.
  • So ideally, there would be a generator library providing relatively generic components helping process and iterate over the C headers AST, that one would then use to implement a generator tool that bakes in assumptions and knowledge about the implicit conventions of a specific library.
  • Generating wrapper code for structure accessors or enum declarations is automatable in a relatively generic way. Nevertheless, some wrapper code (specific to the given library) will have to be written by hand. I wrote wrapper code by hand for the callback system of the library I was looking at and their implementation of linked lists, then made these manual bindings available to my automated generator.
  • Even writing all this library-conventions-specific code, I’m not sure that the bindings generation can be completely press-button. In particular, it seems hard to account for the common pattern of functions returning their results through arguments. If a library exposes a function void rotate(int x, int y, int z, int* ox, int* oy, int* oz), are ox, oy and oz arguments of the function, or used as return values? (in which case the wrapper code should generate an ocaml function returning a tuple). It’s quite obvious here that they are return values, but it seems hard to automate that without relying on brittle heuristics.
    Maybe that means that user must specify by hand the “polarity” of arguments for every function of the API that uses some arguments as return values?

To wrap up, I think it’s an interesting approach, certainly more viable than writing the bindings by hand (even with ctypes) as I started doing, and I’m very glad that clangml exists so that we can experiment with it.
What I found hard was to draw a line between components of the bindings generator library that should be reusable, and parts that were making assumptions about conventions followed by the specific C library I was looking at.

I will try to release the code and write a blog post about it at some point.

I was also aware of the bindgen tool for rust, but not python’s ctypeslib, so that’s something that I should indeed check out.

3 Likes

Clangml looks amazing! I have been thinking about what it would take to migrate infer’s current clang frontend to clangml. Currently, infer parses C/C++/Objective-C into OCaml with a clang plugin written in C++ that dumps the clang AST into biniou by running clang with the plugin attached. Infer then deserialises the biniou data into an OCaml data structure. It would be amazing to replace the plugin by bindings to clangml!

After playing around with it for a bit, I didn’t find much in clangml about how to go from a clang command to something to give to clangml to get a parsed file. Also, the Clang.Command_line module seems to support only a few options. Infer gets as input clang driver commands and from that has to parse the C/C++/Objective-C files involved that the command would compile if run, and the clang command-line flags are typically needed to compile these files, and to parse them too I would guess (in particular, pre-processor options like -Dfoo=bar, various -I, -isystem, …). Is there a way to pass these flags to clangml somehow? Let me know if I should post this question on the gitlab instead.

5 Likes

Not sure if it completely answers your question, but I’m doing something like:

let ast =
  Clang.Ast.parse_file
    ~command_line_args:(
      [Clang.Command_line.include_directory Clang.includedir] @
      ["-DFOO"; "-Ibar/baz"]
    )
    filename
2 Likes

D’uh! Ok, thanks, I’ll give another try.

One thing that’s not obvious is how to extract the file name from a clang command. In fact, a single clang driver command can compile several files. clang -### <original arguments> will output one -cc1 command for each file, and usually the filename is the last argument but I’ve seen cases where that assumption was violated.

Is it possible to retain comments in the AST? From the documentation it seems that currently ‘comment nodes’ are not supported. I tried to pass the --comments flag to the parse function, but the comments are gone.

That’s something that I needed as well, and it is indeed possible! Digging through my email, this is the solution that @thierry-martinez provided me at that time:

  1. One can use Clang.cursor_get_raw_comment_text to retreive doxygen comments.
let example = {|
struct foo {
  /** bla */
  int bla;
};
|}

let () =
  let ast = Clang.Ast.parse_string example in
  let bla =
    match ast with
    | { desc = { items = [
        { desc = RecordDecl { fields = [bla]; _ }; _ }]; _ }; _ } ->
        bla
    | _ -> assert false in
  assert
    (Clang.cursor_get_raw_comment_text (Clang.Ast.cursor_of_node bla)
      = "/** bla */")
  1. To get non-doxygen comments, one has to use a lower level API that talks to the lexer directly:
let example = {|
struct foo {
  int bla; // comment
  int bar;
  int tux; // comment2
    // ctd
};
|}

let range_start_of_node node =
  Clang.get_range_start (Clang.get_cursor_extent
    (Clang.Ast.cursor_of_node node))

let range_end_of_node node =
  Clang.get_range_end (Clang.get_cursor_extent (Clang.Ast.cursor_of_node node))

let comment_of_range (tu : Clang.cxtranslationunit)
    (range : Clang.cxsourcerange) : string list =
  let rec aux accu tokens =
    match tokens with
    | [] -> accu
    | hd :: tl ->
        match Clang.get_token_kind hd with
        | Punctuation -> aux accu tl
        | Comment -> aux (Clang.get_token_spelling tu hd :: accu) tl
        | _ -> accu in
  List.rev (aux [] (Array.to_list (Clang.tokenize tu range)))

let fields_with_comment (record_decl : Clang.Decl.t)
    : (Clang.Decl.t * string list) list =
  let fields =
    match record_decl with
    | { desc = RecordDecl { fields; _ }; _ } -> fields
    | _ -> invalid_arg "fields_with_comment" in
  match fields with
  | [] -> []
  | hd :: tl ->
      let tu =
        Clang.cursor_get_translation_unit
          (Clang.Ast.cursor_of_node record_decl) in
      let record_decl_end = range_end_of_node record_decl in
      let rec aux accu first others =
        let first_end = range_end_of_node first in
        match others with
        | [] ->
            let comment =
              comment_of_range tu (Clang.get_range first_end record_decl_end) in
            List.rev ((first, comment) :: accu)
        | hd :: tl ->
            let hd_start = range_start_of_node hd in
            let comment =
              comment_of_range tu (Clang.get_range first_end hd_start) in
            aux ((first, comment) :: accu) hd tl in
      aux [] hd tl

let () =
  let ast = Clang.Ast.parse_string example in
  let foo =
    match ast with
    | { desc = { items = [foo]; _ }; _ } -> foo
    | _ -> assert false in
  match fields_with_comment foo with
  | [({ desc = Field { name = "bla"; _ }; _ }, ["// comment"]);
      ({ desc = Field { name = "bar"; _ }; _ }, []);
      ({ desc = Field { name = "tux"; _ }; _ }, ["// comment2"; "// ctd"])] ->
        ()
  | _ -> assert false
2 Likes

Thanks for the fast response, it will be very useful!

Does anyone know how to analyse _Atomic types ?

I am trying to read a declaration such as
typedef _Atomic _Bool atomic_bool;
But the ast node I get for the underlying type is
Clang.Ast.UnexposedType Clang.Atomic
with apparently no way to get back the ‘_Bool’ part.

Indeed, _Atomic types are not yet exposed, thanks for reporting this! I will add support for them in the next release. For the time being, you may use Clang.type_get_value_type to retrieve the cxtype of the underlying type:

let () =
  let ast =
    Clang.Ast.parse_string ~filename:"atomic.c" {|
      typedef _Atomic _Bool atomic_bool;
    |} in
  let atomic_type =
    match (List.hd ast.desc.items).desc with
    | TypedefDecl {
        name = "atomic_bool";
        underlying_type
      } -> underlying_type
    | _ -> assert false in
  let value_type =
    Clang.Type.of_cxtype (Clang.type_get_value_type atomic_type.cxtype) in
  assert (value_type.desc = BuiltinType Bool)

Clang.type_get_value_type does not exist on my side it seems.
Could it be related to the clang version I use? That is clang 10 (on ubuntu 20.04).

Oh yes, sorry, clangml’s continuous integration just made me realized that the underlying C function clang_Type_getValueType was introduced in Clang 11…

The snapshot version of clangml now contains compatibility code for destructing _Atomic types. You may use opam pin to install it:

opam pin add https://github.com/thierry-martinez/metapp.git
opam pin add https://github.com/thierry-martinez/metaquot.git
opam pin add https://gitlab.inria.fr/memcad/clangml.git#snapshot

I hope to make a release soon!

Thank you for the quick answer and the fix. It is really helpful already.

I still get some Clang.Ast.UnexposedType Clang.Atomic (together with the new Clang.Ast.Atomic node) on some examples though. I will probably open an issue on your gitlab if I find some time to craft some minimal example.

Thank you for the feedback! I think that the Clang.Ast.UnexposedType Clang.Atomic that remained were due to an unfortunate code duplication in clangml. I just fixed this code duplication in the snapshot version: you can try opam reinstall clangml, which should update the pinned version.

If the problem is still there, some code examples would be useful.

Ok, now that I can parse atomic types, I am stuck on atomic expressions :slight_smile:.

Let’s say that I want to parse
atomic_load_explicit((uint64_t _Atomic const volatile *)addr,memory_order_relaxed);

So far I am doing something like that:

match e.desc with
  | Clang.Ast.UnexposedExpr AtomicExpr ->
    let cx = Clang.Ast.cursor_of_node e in
    let args = List.map Clang.Expr.of_cxcursor  (Clang.list_of_children cx)
   (* do something with the arguments *)

|  (*...*)

But, how can I decide which kind of atomic operation it is (store/load/etc)?

You can’t… except if you update to the snapshot version I just published. :wink:

Sorry for all these difficulties: these are corners of the language that have not been tested. Unfortunately, as far as I know, libclang does not expose anything to access the atomic operator, so there is no workaround for previous versions of clangml: you have to use the new snapshot version where I added an interface to the underlying C++ method…