Critique my use of FFI

hyphenrf · May 20, 2023, 11:32pm

I’ve read this post just now and it made me wonder how much better/worse we stand next to Go in terms of FFI calls and finalisers. I’ve ported the provided benchmark code to OCaml and did some measurements. I’ll try my best to post code snippets that are as focused as possible, instead of dumping everything here and creating too much noise. But first, numbers!

test	ocaml 5.0	go 1.19.9
native add	0.3 ns	0.3 ns
foreign add	1.5 ns	60.4 ns
c allocate & free	4.0 ns	71.6 ns
allocate ∘ free	17.2 ns	133.3 ns
alloc auto	125.6 ns	1180 ns
alloc dummy	131.0 ns	1197 ns
alloc custom	109.4 ns	-

I benchmarked by passing the functions in question to a bench function which ran them in a loop for _ = 1 to 1M do ignore (f ()) done between two Unix.gettimeofday calls, multiplying the result by 1000.0
For foreign/native add I inlined the addition operation in the loop.

It looks like we’re observing a similar slowdown behavior—an order of magnitude. and also looks like OCaml’s C calls are putting Go’s to complete shame, impressively an order of magnitude faster in all but the native case!

The two approaches used in this blog post were manual memory management and finalisers. I’ve additionally added custom blocks to the mix, since that’s usually something you find used with abstract types e.g. in LLVM bindings.

For Go’s Cstr representation I opted for the following in module Cstr:

type t = { pointer : ptr } [@@boxed]
 and ptr = private int

Then used untagged annotation for the external function declarations:

external alloc_ptr : unit -> (ptr [@untagged]) = "" "Alloc"
external free_ptr : (ptr [@untagged]) -> unit = "" "Free"

for the Addition FFI function, I used untagged as well as noalloc. For the finalization hooks, I wrote the following:

let mk_alloc fin () =
  let answer = { pointer = alloc_ptr() } in
  Gc.finalise fin answer;
  answer

let free {pointer} = free_ptr pointer

let alloc = mk_alloc free
let alloc_dummy = mk_alloc ignore

I found no equivalent of Go’s runtime.KeepAlive and honestly, unless I’m mistaken, it’s probably irrelevant in our case.

Now let’s take a look at the stub… I want to preface this by saying I know I should be using Ctypes and co. for my FFI needs, but I wanted to remain apples-to-apples as much as possible with Go.

For Addition, I just used int, the general advice seems to be to always prefer intnat as to not cause truncation problems, but I wanted to remain as faithful as possible to the Go code, including potential overflows!

int Addition(int a, int b) { return a + b; }

Alloc and Free were void to char* and back, I haven’t used CAMLparam*/CAMLreturn* macros for them because I wasn’t touching OCaml GC. That’s the impression I got from the manual about when to use them.

void Free(char *p) { free(p); }

As for the final custom block approach, I defined the struct relying on the fact that static data is zeroed by default:

static const struct custom_operations custom = {
        .identifier = "org.ocaml.discuss.custom",
        .finalize = Free_custom,
};

With custom there was no reason to avoid substituting void and char* for a proper value type, and reflecting that on OCaml side:

type custom
external alloc_custom : unit -> custom = "Alloc_custom"

but I’m not really sure if I did it right in the stub:

#define Custom(t, v) (*((t*)Data_custom_val(v)))

void Free_custom(value v) {
        free(Custom(char*, v));
}

value Alloc_custom(value _) {
        CAMLparam1(_);
        CAMLlocal1(v);
        v = caml_alloc_custom_mem(&custom, sizeof(char*), BYTES);
        Custom(char*, v) = Alloc();
        CAMLreturn(v);
}

That’s all! thanks for reading. Please let me know if I made any rookie mistakes with the stubs or benchmark code.

yawaramin · May 21, 2023, 1:18am

You may actually want to use Sys.time instead to get the difference in processor time to avoid Catastrophic cancellation - Wikipedia

silene · May 21, 2023, 6:28am

No. Catastrophic cancellation is an exact operation, which means that the output error is exactly the input error, hence “garbage in garbage out”. So, you need to evaluate how much garbage is inputted. In the case of Unix.gettimeofday, it is about one millionth of a second, which, after one million operations as done here, become one thousandth of a nanosecond, and thus completely irrelevant.

(To be fair, Unix.gettimeofday does suffer from a small defect. It is a wall time rather than a process time, which means that it also accounts for other processes, and thus requires the computer to be at rest.)

dbuenzli · May 21, 2023, 7:54am

Another larger defect is that it can go forward or back in time since your operating system may adjust time behind your back. You rather want to use a monotonic wall time clock such as the one provided by Mtime_clock for these kind of measurements.

nojb · May 21, 2023, 7:56am

No mistakes, but a couple of remarks for the benefit of interested readers:

In Alloc_custom, it is not required to register either the unit function parameter or the local variable v as GC roots (using CAMLparam*/CAMLlocal*), but it doesn’t hurt to do it. In general you only need to register a variable as a GC root if it is alive across an instruction that calls into the GC (eg an allocation). In the manual this “optimization” is not explained because it is easy to shoot yourself in the foot with it if you are not careful.
In the Go example, the pointer field is supposed to represent an arbitrary C pointer. Since ints have one bit less of precision than the word size, you will need to make use of the fact that pointers in most machines are aligned to right-shift them into an int, and left-shift them out of the int representation before using them.
Typically you will want to zero out the data pointer in your custom block after freeing it in Free_custom. That allows you to check and fail at runtime if you try to access the custom block after freeing it.

Cheers,
Nicolas

silene · May 21, 2023, 8:55am

Nowadays, on Linux at least, there are only three cases where it might happen:

The administrator manually modified the system time during a bench, so your own fault.
The ntp server adjusted the clock at computer boot time or when going out of sleep, so not relevant when benchmarking.
The ntp server abruptly adjusted the clock (rather than gradually, e.g., for warp seconds or just plain drift) because it is completely wrong, but then so would be the monotonic clock, which means that the computer is not suitable for benchmarking.

(And in case someone wonders, the monotonic clock is also impacted by gradual ntp adjustments. It is only protected against abrupt ntp adjustments. So, while it will not go backward, it will just stop going forward until the drift is resolved.)

So, Unix.gettimeofday is fine for system-wide benchmarking, and it does not even incur a system call, so it can be called often.

dbuenzli · May 21, 2023, 12:10pm

Maybe you are right but that remains conditional on a lot of ifs. I think it’s better to nudge people into using the right tools to measure wall clock time on their computer regardless of the context – diffing the result of gettimeofday is not a good idea in general.

On linux there is CLOCK_MONOTONIC_RAW.

hyphenrf · May 22, 2023, 1:44am

Thank you all for the critiques. Few things I wanna address:

Doesn’t the local variable v fit this requirement, and so I must register it?

Thank you for pointing this out. I omitted Alloc before it felt noisy but here it is

enum { BYTES = 100 };

char *Alloc() {
	char *x = malloc(BYTES);
	assert(0 == ((size_t)x & 1));
	return x;
}

The [@untagged] annotations should properly take care of converting pointer values back and forth between OCaml and C. The only issue I envision here is if the pointer value has the MSB set, which would get lost upon shifting. I went with the int solution because the manual poses it as valid… It probably is safer to store them on Int64 or Nativeint if only for having less assumptions to rely on

I initially went with mirage-clock-unix, with Mclock, which gave near-identical results. So I did actually try to do things right, but changed it later to builtin functions to be able to share the two files, have people install just the ocaml compiler, run a command, and have an executable ready to play with.

What’s surprising to me is that the recommendation between Unix.gettimeofday and Sys.time is the latter! I remember reading the exact opposite advice a long time back, and I don’t remember the exact reasons, but it was something related to threading.

yawaramin · May 22, 2023, 4:30am

The reason I recommended Sys.time is because:

It doesn’t pull in a unix dependency
It’s process time so not impacted by other workloads on your system
Smaller values so relatively speaking measuring differences between the values yields more accurate results

But actually Daniel’s recommendation is more correct, you almost certainly want wall clock time, not processor time.

nojb · May 22, 2023, 6:01am

No; v is not live across an allocation point (as it is written to just after the allocation itself). One way of understanding the condition “live across an allocation point” is as follows: when variables such as v contain pointers, the blocks they point to may be moved each time the GC runs. In order for the variables to be updated accordingly and not be left dangling, they need to be registered with the GC by using the CAMLparam*/CAMLlocal* macros. However, if at the allocation point, a variable has not yet been initialized with a valid pointer, or if its value will be modified before their next use, then it does not need to be registered. This is the case for v: as the result of the allocation will be written to v, even if it had been initialized with a valid pointer previously (which is not the case here), there’s no point preserving its previous value, so it does not need to be registered.

Actually, what I said was slightly wrong: you don’t shift the pointer in any way to fit it into an int, you just twiddle with its least significat bit (which is always zero for an aligned pointer), so there is no issue here.

Cheers,
Nicolas

hyphenrf · May 24, 2023, 10:42pm

Thank you so much, the condition for registration is now perfectly clear to me! And I understand why the manual just errs on the safe side instead of being more specific about such detail.

And yeah if I’m not mistaken, malloc always returns pointers with alignof(max_align_t), so I probably don’t even need that assertion!

Topic		Replies	Views
How to speed up this function? Learning	32	3151	August 21, 2022
Using Ocaml to write network interface drivers Community	24	4714	February 14, 2020
Profiling an OCaml program Ecosystem profiling	4	350	June 27, 2025
Mysterious performance difference between close to identical programs Learning multicore	8	1196	January 17, 2022
[ANN] A dynamic checker for detecting naked pointers Ecosystem multicore , compiler	36	5741	May 28, 2020

Critique my use of FFI

Related topics