How to avoid write barrier slowdown


#1

Hi folks,

I am writing some numerical computation code in OCaml that operates over arrays that are exposed globally. However, I am finding that what I believe to be a write barrier causing a significant slowdown.

A minimal example illustrating the issue:

open Core
open Core_bench.Std

let inner_b n () =
  let v : int ref = ref 0 in
  for i = 0 to n do v := !v + !v done

let v : int ref = ref 0
let outer_b n () =
  for i = 0 to n do v := !v + !v done

let main () =
  let n = 100 in
  Command.run (Bench.make_command [
      Bench.Test.create ~name:"inner test" (inner_b n);
      Bench.Test.create ~name:"outer test" (outer_b n)
    ])

let _ = main ()

We can see that timings are greatly affected:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name       β”‚ Time/Run β”‚ Percentage β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ inner test β”‚  56.05ns β”‚     18.83% β”‚
β”‚ outer test β”‚ 297.69ns β”‚    100.00% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

I could imagine a few workwarounds to this issue, but given that I am not expert on the GC I thought I would ask for advice first.


#2

Not sure how much this helps you, but from what I can tell, the performance gap is due to v being optimized into a stack-allocated variable in the first case:

$ ocamlc -dlambda t.ml
(setglobal T!
  (let
    (inner_b/1002 =
       (function n/1003 param/1202
         (let (v/1004 =v[int] 0)
           (for i/1201 0 to n/1003 (assign v/1004 (+ v/1004 v/1004)))))
     v/1203 = (makemutable 0 (int) 0)
     outer_b/1204 =
       (function n/1205 param/1207
         (for i/1206 0 to n/1205
           (setfield_imm 0 v/1203 (+ (field 0 v/1203) (field 0 v/1203))))))
    (makeblock 0 inner_b/1002 v/1203 outer_b/1204)))

#3

Indeed, I was able to observe that, so maybe I should rephrase my question to β€œhow can I have fast global heap allocated buffers” ?

I thought of using Bigarray / writing my own C routine, but then I was afraid of the cost of an extern call. My computing scenario here is mainly very small β€œkernels” [fused or not] performing updates to global buffers or arrays.

I could also pass my buffers around, but that’d complicate things.


#4

The compiler doesn’t generate any calls to caml_modify from your code, in other words there are no write barriers. The write barrier occurs only if the right-hand side is a heap value, since you’re using immediates (ints) that can’t contain any reference, there is nothing to check here. The following code will invoke the write barrier:

type box = Box of int
let v : box ref = ref (Box 0)
let outer_c n () =
  for i = 0 to n do
    let {contents=Box x} = v in
    v := Box (x + x);
  done

The reason why the inner_b is faster, is because it doesn’t perform any effect that is visible outside. In fact it is more like a no-op, so that compiler should erase this code at all. Instead it compiles it just to a loop that increments the loop counter (and do not update the v reference at all), roughly it corresponds to for (int i = 0; i <= n; i++);, this is the x86-64 assembly generated by 4.05 without flambda:

camlExample__inner_b_1204:
	.cfi_startproc
.L102:
	movq	$1, %rbx
	cmpq	%rax, %rbx
	jg	.L100
.L101:
	movq	%rbx, %rdi
	addq	$2, %rbx
	cmpq	%rax, %rdi
	jne	.L101
.L100:
	movq	$1, %rax
	ret

As you may see there are no allocations, just a cycle, that updates the loop variable (+2 on each cycle, due to the OCaml int representation). Concerning the outer_b function it is also compiled to a very efficient code

camlExample__outer_b_1210:
.L105:
	movq	$1, %rbx
	cmpq	%rax, %rbx
	jg	.L103
.L104:
	movq	camlExample@GOTPCREL(%rip), %rdi
	movq	32(%rdi), %rdi
	movq	(%rdi), %rsi
	leaq	-1(%rsi,%rsi), %rsi
	movq	%rsi, (%rdi)
	movq	%rbx, %rdi
	addq	$2, %rbx
	cmpq	%rax, %rdi
	jne	.L104
.L103:
	movq	$1, %rax
	ret

It also doesn’t have any allocations nor, of course, any write brarriers, the different is the difference between the code that does nothing and the code that does something.

Concerning the general advice on how to get away from the write barriers. The advice number one (after of course the advice not to use mutable data) would to try to represent your modifiable data as immediate (using interning, for example). You can also try to minimize the number of updates, by gathering the right hand side into one big data structure.

P.S. one of the sources of inefficiency in the outer_b is that you global variable has the external linkage, i.e., it is also shared by other compilation units, providing an empty mli file or wrapping your code in module M : sig end = struct <your-code> end will enable much more efficient code.


#5

Oh thanks @ivg, indeed I’ve simplied the code too much… and even if I did look to the assembly I didn’t pay too much attention; sorry for the dummy example :frowning: Thanks for the advice, let me try to redo the example so I get my original case where the write barrier appeared.


#6

Yep, this is the common problem with microbenchmarking, you always end up measuring something irrelevant :wink:

You can also try to explain your problem in high level terms. From what you’ve already said, I can easily conclude that the write barrier shouldn’t be the cause of the slowdown, as it shall not arise in the numeric code. Especially, if you choose the right representation, e.g., type complex = {re : float; im : float} and vector = complex array is indeed a bad representation. Not only it will involve write barriers on every update, but it will also be quite memory inefficient. (Hint, the right representation is a bigarray with the Complex64 kind, or two bigarrays of the Float64 kind, as the latter will save you from allocating a new record on each access to the array element).


#7

Thanks @ivg ; the original examples used float array and we had some hope for unboxing; but after your round of feedback it is clear we want to try a few more things including letting OCaml know that the buffers are local to the module.