[ANN] clangml 4.2.0: OCaml bindings for Clang API (for C and C++ parsing)

Dear OCaml users,

We are happy to announce the new clangml 4.2.0 release. Clangml provides bindings for all versions of Clang, from 3.4 to the not yet released 10.0.1.

The library can be installed via opam: opam install clangml
The documentation is online: https://memcad.gitlabpages.inria.fr/clangml/

This new release improves C++ support, including C++20 specific constructs.

All Clang C/C++ attributes should now be supported. You may have a look to the interface of the new auto-generated module Attributes.

There is now a lazy version of the AST (Clang.Lazy.Ast): this is useful to explore large ASTs efficiently (note that Clang parsing itself can still be slow; the lazy part only concerns the conversion into the Clang.Lazy.Ast datatypes).

Happy hacking!

9 Likes

Potentially could be useful for automated OCaml bindings generation based on the C/C++ headers parsing, like one of the listed solutions in Advanced C binding using ocaml-ctypes and dune. The closest model would be Python’s ctypeslib.

3 Likes

Impressive, now people will start to analyze C++ code.
Must be tough.

1 Like

Automated bindings generation is one of the motivation for this work, indeed, and the bindings to the Clang API themselves are auto-generated that way (but the generators are still quite ad-hoc).

1 Like

I also started experimenting a couple months ago on a tool/library that uses clangml to generate C bindings from C headers. (at that time this uncovered a couple bugs in clangml and @thierry-martinez was very quick in fixing them, so many thanks to him)

My use case was to generate bindings for a library with a large API surface (many functions, many public structures with many fields), that evolves over time, but which follows consistently a set of conventions through its API. Furthermore, the library manages memory in a way that is quite convenient for FFI bindings (no ownership transfer, the library manages its own memory and only produces handles (structure pointers) to memory it manages itself—the lifetime of data is less obvious but it’s less crucial for OCaml bindings).

Here’s a summary of some of the design constraints and observations that I can remember:

  • The generator must produce wrapper code that directly corresponds to an idiomatic OCaml API. The goal is to avoid manually going through the whole surface of the C API, so there is no point in e.g. generating a description of the library in some C-like OCaml eDSL (e.g. ctypes) if one then needs to manually write wrapper code on top of that.
  • Thanks to the fact that the C library implements some conventions and uniform memory discipline, it should be possible to write generator code that takes advantage of these conventions.
    For instance, the memory management discipline of the library meands that I can build in my generator that for any structure s_foo the library exposes, a value of type *s_foo can always be represented on the ocaml side as an integer value corresponding to the value of the pointer, and exposed as an abstract type s_foo (complemented with accessor functions for the fields of the structure)—without having to copy data around.
  • So ideally, there would be a generator library providing relatively generic components helping process and iterate over the C headers AST, that one would then use to implement a generator tool that bakes in assumptions and knowledge about the implicit conventions of a specific library.
  • Generating wrapper code for structure accessors or enum declarations is automatable in a relatively generic way. Nevertheless, some wrapper code (specific to the given library) will have to be written by hand. I wrote wrapper code by hand for the callback system of the library I was looking at and their implementation of linked lists, then made these manual bindings available to my automated generator.
  • Even writing all this library-conventions-specific code, I’m not sure that the bindings generation can be completely press-button. In particular, it seems hard to account for the common pattern of functions returning their results through arguments. If a library exposes a function void rotate(int x, int y, int z, int* ox, int* oy, int* oz), are ox, oy and oz arguments of the function, or used as return values? (in which case the wrapper code should generate an ocaml function returning a tuple). It’s quite obvious here that they are return values, but it seems hard to automate that without relying on brittle heuristics.
    Maybe that means that user must specify by hand the “polarity” of arguments for every function of the API that uses some arguments as return values?

To wrap up, I think it’s an interesting approach, certainly more viable than writing the bindings by hand (even with ctypes) as I started doing, and I’m very glad that clangml exists so that we can experiment with it.
What I found hard was to draw a line between components of the bindings generator library that should be reusable, and parts that were making assumptions about conventions followed by the specific C library I was looking at.

I will try to release the code and write a blog post about it at some point.

I was also aware of the bindgen tool for rust, but not python’s ctypeslib, so that’s something that I should indeed check out.

3 Likes

Clangml looks amazing! I have been thinking about what it would take to migrate infer’s current clang frontend to clangml. Currently, infer parses C/C++/Objective-C into OCaml with a clang plugin written in C++ that dumps the clang AST into biniou by running clang with the plugin attached. Infer then deserialises the biniou data into an OCaml data structure. It would be amazing to replace the plugin by bindings to clangml!

After playing around with it for a bit, I didn’t find much in clangml about how to go from a clang command to something to give to clangml to get a parsed file. Also, the Clang.Command_line module seems to support only a few options. Infer gets as input clang driver commands and from that has to parse the C/C++/Objective-C files involved that the command would compile if run, and the clang command-line flags are typically needed to compile these files, and to parse them too I would guess (in particular, pre-processor options like -Dfoo=bar, various -I, -isystem, …). Is there a way to pass these flags to clangml somehow? Let me know if I should post this question on the gitlab instead.

3 Likes

Not sure if it completely answers your question, but I’m doing something like:

let ast =
  Clang.Ast.parse_file
    ~command_line_args:(
      [Clang.Command_line.include_directory Clang.includedir] @
      ["-DFOO"; "-Ibar/baz"]
    )
    filename
2 Likes

D’uh! Ok, thanks, I’ll give another try.

One thing that’s not obvious is how to extract the file name from a clang command. In fact, a single clang driver command can compile several files. clang -### <original arguments> will output one -cc1 command for each file, and usually the filename is the last argument but I’ve seen cases where that assumption was violated.

Is it possible to retain comments in the AST? From the documentation it seems that currently ‘comment nodes’ are not supported. I tried to pass the --comments flag to the parse function, but the comments are gone.

That’s something that I needed as well, and it is indeed possible! Digging through my email, this is the solution that @thierry-martinez provided me at that time:

  1. One can use Clang.cursor_get_raw_comment_text to retreive doxygen comments.
let example = {|
struct foo {
  /** bla */
  int bla;
};
|}

let () =
  let ast = Clang.Ast.parse_string example in
  let bla =
    match ast with
    | { desc = { items = [
        { desc = RecordDecl { fields = [bla]; _ }; _ }]; _ }; _ } ->
        bla
    | _ -> assert false in
  assert
    (Clang.cursor_get_raw_comment_text (Clang.Ast.cursor_of_node bla)
      = "/** bla */")
  1. To get non-doxygen comments, one has to use a lower level API that talks to the lexer directly:
let example = {|
struct foo {
  int bla; // comment
  int bar;
  int tux; // comment2
    // ctd
};
|}

let range_start_of_node node =
  Clang.get_range_start (Clang.get_cursor_extent
    (Clang.Ast.cursor_of_node node))

let range_end_of_node node =
  Clang.get_range_end (Clang.get_cursor_extent (Clang.Ast.cursor_of_node node))

let comment_of_range (tu : Clang.cxtranslationunit)
    (range : Clang.cxsourcerange) : string list =
  let rec aux accu tokens =
    match tokens with
    | [] -> accu
    | hd :: tl ->
        match Clang.get_token_kind hd with
        | Punctuation -> aux accu tl
        | Comment -> aux (Clang.get_token_spelling tu hd :: accu) tl
        | _ -> accu in
  List.rev (aux [] (Array.to_list (Clang.tokenize tu range)))

let fields_with_comment (record_decl : Clang.Decl.t)
    : (Clang.Decl.t * string list) list =
  let fields =
    match record_decl with
    | { desc = RecordDecl { fields; _ }; _ } -> fields
    | _ -> invalid_arg "fields_with_comment" in
  match fields with
  | [] -> []
  | hd :: tl ->
      let tu =
        Clang.cursor_get_translation_unit
          (Clang.Ast.cursor_of_node record_decl) in
      let record_decl_end = range_end_of_node record_decl in
      let rec aux accu first others =
        let first_end = range_end_of_node first in
        match others with
        | [] ->
            let comment =
              comment_of_range tu (Clang.get_range first_end record_decl_end) in
            List.rev ((first, comment) :: accu)
        | hd :: tl ->
            let hd_start = range_start_of_node hd in
            let comment =
              comment_of_range tu (Clang.get_range first_end hd_start) in
            aux ((first, comment) :: accu) hd tl in
      aux [] hd tl

let () =
  let ast = Clang.Ast.parse_string example in
  let foo =
    match ast with
    | { desc = { items = [foo]; _ }; _ } -> foo
    | _ -> assert false in
  match fields_with_comment foo with
  | [({ desc = Field { name = "bla"; _ }; _ }, ["// comment"]);
      ({ desc = Field { name = "bar"; _ }; _ }, []);
      ({ desc = Field { name = "tux"; _ }; _ }, ["// comment2"; "// ctd"])] ->
        ()
  | _ -> assert false
2 Likes

Thanks for the fast response, it will be very useful!