[ANN] reed-solomon-erasure 1.0.1

Overview

reed-solomon-erasure is an OCaml implementation of Reed-Solomon erasure coding.

It is a port of the Rust library reed-solomon-erasure, which is a port of several other libraries.

It provides functions for encoding, verifying, and reconstructing data and parity shards, and deals with string, bytes, and Core_kernel.Bigstring.t

You can install it via opam install reed-solomon-erasure.

Example

open Reed_solomon_erasure

let () =
  let r = ReedSolomon.make 3 2 in (* 3 data shards, 2 parity shards *)

  let master_copy = [|"\000\001\002\003";
                      "\004\005\006\007";
                      "\008\009\010\011";
                      "\000\000\000\000"; (* last 2 rows are parity shards *)
                      "\000\000\000\000"|] in

  (* Construct the parity shards *)
  ReedSolomon.encode_str r master_copy;

  (* Make a copy and transform it into option shards arrangement
     for feeding into reconstruct_opt_str *)
  let shards = RS_Shard_utils.shards_to_option_shards_str master_copy in

  (* We can remove up to 2 shards, which may be data or parity shards *)
  shards.(0) <- None;
  shards.(4) <- None;

  (* Try to reconstruct missing shards *)
  ReedSolomon.reconstruct_opt_str r shards;

  (* Convert back to normal shard arrangement *)
  let result = RS_Shard_utils.option_shards_to_shards_str shards in

  assert (ReedSolomon.verify_str r result);
  assert (master_copy = result)

Performance

The encoding performance is shown below

Machine : laptop with Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz (max 2.70GHz) 2 Cores 4 Threads

Configuration Klaus Post’s reed-solomon-erasure (Rust) ocaml-reed-solomon-erasure (bigstr) … (bytes) … (str)
10x2x1M ~7800MB/s ~4500MB/s ~3000MB/s ~1300MB/s ~1300MB/s

Links :

EDIT: Fixed benchmark data

7 Likes

This is quite nice, thank you! I would be interested in a bit more information on the performance profile and where you think your per-byte budget is being spent. According to the rust project README, the previous version of the Rust library, which did not use SIMD intrisics, performed at 1100MB/s on a similar machine.
Your code does use SIMD intrisics (the heavy lifting seems reused from Nicolas Trangez’ implementation, also used in the Rust library), but for example, is it possible to measure the performance without those SIMD intrisics, to compare to the other pre-SIMD implementations?

(I’m also curious at whether it is possible to measure the cost of the switch from OCaml to C and back – in your case using ctypes – separately from the cost of the rest of the OCaml codebase, for example.)

A direct comparison to the Rust version would be slightly unfair, since the Rust version uses threading very heavily(via rayon, a data parallelism library specifically), so all CPU cores are used instead of just one core. So it wasn’t just purely due to Rust being more low-level - I played a lot of tricks too.
A more fair comparison would require setting the maximum thread count to 1, though I am not sure how much it would matter ultimately.

I had a fully functional version at some point that uses bytes as basis, and performed at roughly 57MB/s on the same machine(which made me switch to using Bigstring + C FFI, since 57MB/s is absolutely a bottleneck for almost everything).

I did not add conditional compilation stuff to the current version, but the code can be easily tweaked to not use any SIMD intrinsics code. Basically just change the following code in src/galois.ml (starting at line 93)

let mul_slice (c : char) (input : bigstring) (out : bigstring) =
  let low  = Bigarray.Array2.slice_left mul_table_low  (c |> int_of_char) in
  let high = Bigarray.Array2.slice_left mul_table_high (c |> int_of_char) in

  assert (Bigstring.length input = Bigstring.length out);

  let size = Bigstring.length input in

  let bytes_done =
    (Unsigned.Size_t.to_int
       (reedsolomon_gal_mul
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 low))
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 high))
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 input))
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 out))
          (Unsigned.Size_t.of_int size)))
  in

  mul_slice_pure_ocaml ~skip_to:bytes_done c input out

let mul_slice_xor (c : char) (input : bigstring) (out : bigstring) =
  let low  = Bigarray.Array2.slice_left mul_table_low  (c |> int_of_char) in
  let high = Bigarray.Array2.slice_left mul_table_high (c |> int_of_char) in

  assert (Bigstring.length input = Bigstring.length out);

  let size = Bigstring.length input in

  let bytes_done =
    (Unsigned.Size_t.to_int
       (reedsolomon_gal_mul_xor
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 low))
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 high))
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 input))
          (coerce (ptr char) (ptr uint8_t) (bigarray_start array1 out))
          (Unsigned.Size_t.of_int size)))
  in

  mul_slice_xor_pure_ocaml ~skip_to:bytes_done c input out

to just

let mul_slice (c : char) (input : bigstring) (out : bigstring) =
  mul_slice_pure_ocaml c input out

let mul_slice_xor (c : char) (input : bigstring) (out : bigstring) =
  mul_slice_xor_pure_ocaml c input out

and do make benchmark.

The results on my laptop(the machine for all benchmarks shown) is as follows

bigstr bytes str
~94MB/s ~90MB/s ~88MB/s

So roughly 11x slower compared to pure Rust pre-SIMD version.

I have no idea how to time that alone specifically, but the cost is basically as follows.
For the bigstr version, the only cost of switching is the coerce of the pointers basically(in the above code block) then whatever cost is attached to calling to C FFIs.
For bytes and str versions, they basically have the cost of the coerce, C FFI costs, as well as copying twice - once from bytes/string to Bigstring.t then from Bigstring.t to bytes/string.

(Forgot to mention in above reply, I made some mistakes in benchmarking and have updated the data in the post and on the gitlab repo, Rust github repo is still waiting on PR. Basically I used MB=10^6 bytes while I should have done MB=2^20 bytes.)

Overall the Galois field code does a lot of table lookups, this is kinda okay-ish in terms of speed in Rust(as per the pre-SIMD benchmark data), but table lookup might be too costly in OCaml(not sure), and SIMD intrinsics speed up the lookup process very significantly(by bytes shuffling intrinsics IIRC, you can find the details in the paper “Screaming Fast Galois Field Arithmetic”.

Actually now that I’m thinking about it, I recall I did have some profiling code around, but not setup properly for this version, I can get to that at some point, probably next day.

Apologies for the late reply, I was preoccupied by other stuff.

Here’s the profiler output as gist

The code is at the pure-ocaml-bench branch here

The first few lines of the flat profile


Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 74.00     10.93    10.93                             camlReed_solomon_erasure__Galois__mul_slice_xor_pure_ocaml_inner_5708
 14.22     13.03     2.10 1900021475     0.00     0.00  camlPervasives__char_of_int_1125
  4.60     13.71     0.68      345     1.97     1.97  camlPervasives__$5e_1117
  4.13     14.32     0.61                             camlReed_solomon_erasure__Galois__mul_slice_pure_ocaml_inner_5701
  1.49     14.54     0.22 12582912     0.00     0.00  camlRandom__intaux_1238
0.61 14.63 0.09 12582913 0.00 0.00 camlRandom__bits_1233

Code of mul_slice_xor_pure_ocaml

let mul_slice_xor_pure_ocaml
    ?(skip_to : int = 0)
    (c        : char)
    (input    : bigstring)
    (out      : bigstring)
  : unit =
  let mt = Bigarray.Array2.slice_left mul_table (c |> int_of_char) in

  assert (Bigstring.length input = Bigstring.length out);

  let len = Bigstring.length input in

  for n = skip_to to (len) - 1 do
    out.{n} <- ((mt.{input.{n} |> int_of_char} |> int_of_char) lxor (out.{n} |> int_of_char)) |> char_of_int;
  done

Benchmark

It sits at ~79MB/s for encode_bigstr(this benchmark is very noisy due to me also using the computer, but should be in the same order of magnitude) (EDIT: Actually nvm, just look at above reply for the benchmarks without noise)

encode_bigstr :
    shards           : 10 / 2
    shard length     : 1048576
    time taken       : 12.636033
    byte count       : 1048576000
    MB/s             : 79.138762

It seems that char_of_int is quite heavy in this context, but mul_slice_xor_pure_ocaml itself still uses a lot of time.

I’ll be doing some tuning and testing to see if the performance of the pure OCaml version can be increased. Also appreciate feedback on whether I’ve made any mistakes.

EDIT : Fixed benchmark data

Turns out I made the stupidest mistake - not using unsafe in the pure code. Not sure how I managed to overlook this.

Doing so doubled the performance, see below

bigstr bytes str
196 MB/s 180 MB/s 183 MB/s

Switching to using string as tables instead of Bigstring for pure OCaml side seems to add a slight speed boost(not tested with sufficient rounds for correct conclusion though)

bigstr bytes str
203 MB/s 189 MB/s 186 MB/s

Overall, unsafe pure OCaml is 5x slower than unsafe pure Rust.

1 Like

Thanks! This is quite interesting. On the other hand, I’m asking questions mostly out of idle curiosity, so please do not put too much effort in replying if you are not yourself interested in pursuing this.

Reading your further elements, two questions come to mind:

  • For the pure-ocaml benchmark, have you tried with a flambda switch, which uses a different compiler middle-end which optimizes more aggressively? I would be curious to see numbers both with and without flambda. (The easiest way to do this is to have two separate compiler switches, given that flambda is a configure-time option.) The fact that int_of_char sits so high in the profile suggests that more aggressive inlining etc. could maybe help.

  • For the not-pure-ocaml benchmark, I would still be interested in figuring out where the speed difference between OCaml and other languages is coming from. In theory, the hard work is done in the C routine, so any throughput difference comes from the FFI cost – but given that the C routine works on vectors of value, if I understand correctly, the FFI cost should be amortized and mostly negligible. Where are the OCaml-specific cost
    coming from?

The pure-ocaml benchmarks are interesting to look at because they stress-test the compiler middle-end and backend (the optimizations, and the quality of the code generation). On the other hand, the non-pure-ocaml code is more representative of the kind of workloads number-crunching users may run (so in a sense it is more important in practice), and I don’t have a good mental picture of what causes the FFI costs (additional memory copying? the C-to-OCaml calling convention? GC costs?).

I haven’t looked closely at the numbers or the code, but here are some quick thoughts:

It seems that char_of_int is quite heavy in this context

It’d be interesting to see how much difference switching to Char.unsafe_chr (which skips the range check, and is consequently a no-op) makes.

In theory, the hard work is done in the C routine, so any throughput difference comes from the FFI cost

If the FFI cost does turn out to be significant then you should see a significant improvement if you switch from the Foreign module (which performs dynamic calls) to ctypes stub generation.

Yep, certainly. I am in my holiday and I am interested in pursuing this as well.

There were cases where the Rust library was forced to build in pure Rust mode for some applications due to not working on some platforms(I think it was macOS, due to different C compiler being used, different GCC options etc), so having a pure OCaml version that is of reasonable performance is handy should similar situation arise.

@gasche W.r.t. flambda, is it correct that it is enabled once I’ve made the switch via opam? Or do I need to add compilation flags to compiler?
I did opam switch 4.06.1+flambda but it runs around 2x slower than just 4.06.0 and 4.06.1 for pure-ocaml version.

For not-pure-ocaml version, I made some adjustments to the Rust benchmark suite to make sure it is strictly single threaded, and the difference is roughly 200MB/s. Rust runs at ~3200MB/s, OCaml (bigstr) runs at ~3000MB/s.

And yep, C routine does the heavy lifting, and in this benchmark suite specifically, the FFI cost only adds onto the total cost once every 1MB(one C routine call via FFI per MB). There are some slight differences in code between OCaml and Rust in that Rust has slices and OCaml doesn’t, so the additional allocation might contribute to the latency. But this should be very minimal(adds onto the total cost once every few MBs), so I’ll need to test more to know whether it’s FFI(by trying what @yallop suggested) or the code difference.

During the rewrite, I switched to using Bigstring’s unsafe_set and unsafe_get functions, so the call is no longer necessary. The Bigstring’s unsafe_set and unsafe_get functions subsequently became the heavier functions in profiler’s output, but I don’t think there’s much I can do. (EDIT : Actually that might just be due to noise, I am not very sure on that yet)

Thanks for the info! I’ll give this a shot later.

1 Like

You can check whether flambda is enabled for a compiler with ocamlopt -config | grep ^flambda. Installing the switch and setting your environment to use it eval $(opam config env ...) should be enough, and indeed it looks like (1) you are using the flambda switch correctly and (2) flambda makes this particular code slower instead of faster. Thanks!

It may be possible to write a concurrent version of the OCaml program by using the Thread module to create several worker threads. The current OCaml runtime has a Big Interpreter/Runtime/Kernel lock, but calls to the C FFI don’t need to hold the lock, see the manual: parallel execution of long-running C code for how to release it in a binding (I’m not sure how to do it using ctypes, @yallop would know). Whether that ends up being a performance win depends on the relative cost of context-switching and the sequential work done by each C subroutine.

Yes: for Foreign bindings you can pass ~release_runtime_lock:true to foreign. For generated stubs pass ~concurrency:unlocked to the stub generation functions.

1 Like

Ah, gotcha. I got very confused as I thought flambda should have made it faster, but I suppose the optimization strategy did not work out too well for this code segment.

So I tried several things, and below is the summary

Performance tuning

  • Array splitting does not affect performance, as expected, but thought I’d try.
  • Passing ~release_runtime_lock:true to foreign doesn’t affect performance.
  • Loop unrolling does not affect performance, as expected, but thought I’d try. Seemed to have helped a bit in Rust, but not really sure.

I am guessing replacing current code with stub generation is the only way to find out where the cost with not-pure-ocaml version is coming from. This does not impact my anticipated usage of the library too much(3000MB/s is already way above the bandwidth of many medium), so I might only investigate much later in future.

Updated/corrected comparison to pure Rust

I also reran the Rust benchmark with the single threaded, pure Rust setup. The previous comparison was done with multi-threaded pure Rust setup, so “5x slower” was not entirely accurate.

The performance of said setup sits at ~600MB/s, so pure OCaml version is only 3x slower than pure Rust single-threaded version, which is pretty good.

200MB/s seems to be the best I can do for pure OCaml version, so I’ll leave it at that as I’m happy with it. The very original intention of porting this library from Rust to OCaml was to implement some sort of distributed file/data storage with redundancy using MirageOS, and 200MB/s means it will not be a bottle neck for networked traffic even if running in pure OCaml mode in a gigabit network(which is around 119MB/s).

To potential users of any matrix multiplication based Reed-Solomon libraries in any language

This section is just for clarification(which I should have done earlier I think) and has nothing to do with performance.

The above figures are measured with the data=10, parity=2 configuration. This is chosen due to being reasonably fast, and reasonably resistant to corruption(you can lose 2 blocks out of every 12 blocks in same block set). A reasonably common/sane setup so to speak.

However, it is important to know that the performance degrades drastically(time complexity is O(n)^2 [1]) as you increase the data shard count and/or the parity shard count. Thus the figures above(3000MB/s for OCaml+C, 200MB/s for pure OCaml) are only useful for comparison to other libraries, and are not descriptive of the throughput in your application if your configuration is not data=10, parity=2.

Do test your specific configuration to understand the ramification of the library on your application throughput. Alternatively, consider libraries that use different principles such as FastECC which give better time complexity.

References
[1] GitHub - Bulat-Ziganshin/FastECC: Reed-Solomon coder computing one million parity blocks at 1 GB/s. O(N*log(N)) algo employing FFT.

1 Like

You’ve probably seen this, but flambda has several optimization levels which can be controlled by a compiler option similar to gcc, -O1 to -O3 (and -Oclassic). You want to try -O3 to see if optimization can do anything. Default is -O1.

Thanks for the info! I realised I overlooked optimisation levels entirely.

One reason that might have caused me to not test different optimisation levels (and forgetting about it entirely later) is that then I also need to try all optimisation levels in Rust.

I will get my hands on that when I’m a bit more free as school just started.

Hopefully the overall results and comparison will be interesting, since AFAIK there aren’t that many somewhat non-trivial projects that get both an OCaml version and Rust version written by the same author with essentially the same design.