Working with a huge data chunks

Usually I used Bigarray.Array1.* types and functions for handling the data with Bigarray.array1_of_genarray (Unix.map_file ...). But at some point I started to meet bigger and bigger files, e.g. 10Gb and bigger, where such access has become unacceptably slow. And at the same time I would prefer not to change every single place that accesses the data as a bigarray. What are possible solutions? Or if there will be need to change all accesses, at least as less as possible.

The answer really depends on why the use of mmap is now slow. Are you accessing the data sequentially, or random access? Do you just want to stream them and write out as soon as possible, or do you need more structured patterns?

Once you figure out what the best underlying mechanism to access the files from the operating system is, you can then come up with a suitable abstraction depending on what that is.

The data is structured, but can be a sequential and random access at the same time, think of like scanning the data chunk for some patterns first (they can be sparse), then go to parsing in blocks. There is no need to do that in real time and stream ASAP, but rather perform some computations first, before writing out the result.

That sounds like a good case for mmap, since it’ll only page in the sparse data you access. Have you profiled it to find out why it’s getting slower with large files? You might just be running out of CPU.

I did a few perf measurements, seems the problem of the slow read out of the map - top functions are slicing. Imagine something like

(* we mapped the file to `ba` here *)

let rec bytesweep ~off acc =
     let byte0 = Bigarray.Array1.get ba off in
     let byte1 = Bigarray.Array1.get ba (off+1) in
     let results = match byte0 with
     | '0x00' -> (* some checks *)
     | '0x01' -> (* another checks, more reads, etc)
     (* more matches here, the match is BIG *)
     in
     bytesweep ~off:(off+1) (List.append acc results)
in
bytesweep ~off:0 []

Tried tail version of recursion and usual one, doesn’t change much. The problem is random access of these big chunks of mmaped file probably. I think about making the “window” by copying from mmaped region to the RAM by some big blocks, e.g. 4Mb, and sweeping through them.

What do you mean by sliceing? Do you create views (subarrays)? Also, did you try madviseing?

Yes, depending on the values of these bytes, I might need to peek some more bytes with Bigarray.Array1.sub.

Creating a view, e.g., via reshape or sub is usually slow, because it is a custom value and has to be created in the major heap, thus every creating of a bigarray triggers the minor collection. We were addressing those issues by preparing all the views that we need in advance (or having a pool of views). But the rule of the thumb - do not create bigarrays in a loop.

Though, I’m not sure that this is the root of your problem, because creating of a view doesn’t depend on the size of data (it is just a pointer). It looks like that the bottleneck is the Linux paging mechanism. If this is true, then a clever use of madvise might resolve your issue. The first thing I would try is MADV_HUGEPAGE. Then I will ensure that the access time is set properly, e.g., MADV_SEQUENTIAL. IIRC somewhere in JS Core_extened I’ve seen bindings to madvise.

1 Like

Thanks, that is very good advice! I will try if it will improve the situation.

btw, if you really need to create a subview, then you can create your own substring type on top of bigarray, that will basically have the following type definition

type view = {
   off : int;
   len : int;
   data : Bigstring.t;
}

Then the view function will just change off and len and won’t touch data and therefore ba_sub would never be called.

Surprisingly, this type already exists in BAP, it is called Memory.t :stuck_out_tongue: And Memory.view is efficient as it doesn’t do any slices.

2 Likes

List.append may not be the most efficient here, have you tried collecting your results in another way? (e.g. List.rev_append followed by final List.rev, or storing the results straight into another array)
Also I assume you have something that terminates your recursion and your offsets are always within bounds. If that is the case you could try using unsafe_get instead of get, and if you see any improvements abstract this away into a higher order function, e.g. yourbigarray -> start:int -> stop:int -> f:(int -> int -> 'a list) -> 'a list that performs bounds checking once.

2 Likes

Close to what cstruct provides :slight_smile: : https://github.com/mirage/ocaml-cstruct/blob/master/lib/cstruct.ml#L34

1 Like

Thanks for the answers, I will try a bit of everything tomorrow.

Just for the record - there is no madvise in Core_extended , only in Lwt_unix.

1 Like

Thanks everyone! Applied every single method to improve performance, along with better algorithms to skip the uninteresting data - works like a charm!

5 Likes