Working with a huge data chunks

XVilka · June 19, 2019, 9:41am

Usually I used Bigarray.Array1.* types and functions for handling the data with Bigarray.array1_of_genarray (Unix.map_file ...). But at some point I started to meet bigger and bigger files, e.g. 10Gb and bigger, where such access has become unacceptably slow. And at the same time I would prefer not to change every single place that accesses the data as a bigarray. What are possible solutions? Or if there will be need to change all accesses, at least as less as possible.

avsm · June 19, 2019, 12:55pm

The answer really depends on why the use of mmap is now slow. Are you accessing the data sequentially, or random access? Do you just want to stream them and write out as soon as possible, or do you need more structured patterns?

Once you figure out what the best underlying mechanism to access the files from the operating system is, you can then come up with a suitable abstraction depending on what that is.

XVilka · June 19, 2019, 2:52pm

The data is structured, but can be a sequential and random access at the same time, think of like scanning the data chunk for some patterns first (they can be sparse), then go to parsing in blocks. There is no need to do that in real time and stream ASAP, but rather perform some computations first, before writing out the result.

avsm · June 19, 2019, 5:29pm

That sounds like a good case for mmap, since it’ll only page in the sparse data you access. Have you profiled it to find out why it’s getting slower with large files? You might just be running out of CPU.

XVilka · July 3, 2019, 3:00pm

I did a few perf measurements, seems the problem of the slow read out of the map - top functions are slicing. Imagine something like

(* we mapped the file to `ba` here *)

let rec bytesweep ~off acc =
     let byte0 = Bigarray.Array1.get ba off in
     let byte1 = Bigarray.Array1.get ba (off+1) in
     let results = match byte0 with
     | '0x00' -> (* some checks *)
     | '0x01' -> (* another checks, more reads, etc)
     (* more matches here, the match is BIG *)
     in
     bytesweep ~off:(off+1) (List.append acc results)
in
bytesweep ~off:0 []

Tried tail version of recursion and usual one, doesn’t change much. The problem is random access of these big chunks of mmaped file probably. I think about making the “window” by copying from mmaped region to the RAM by some big blocks, e.g. 4Mb, and sweeping through them.

ivg · July 3, 2019, 4:26pm

What do you mean by sliceing? Do you create views (subarrays)? Also, did you try madviseing?

XVilka · July 3, 2019, 4:30pm

Yes, depending on the values of these bytes, I might need to peek some more bytes with Bigarray.Array1.sub.

ivg · July 3, 2019, 4:36pm

Creating a view, e.g., via reshape or sub is usually slow, because it is a custom value and has to be created in the major heap, thus every creating of a bigarray triggers the minor collection. We were addressing those issues by preparing all the views that we need in advance (or having a pool of views). But the rule of the thumb - do not create bigarrays in a loop.

Though, I’m not sure that this is the root of your problem, because creating of a view doesn’t depend on the size of data (it is just a pointer). It looks like that the bottleneck is the Linux paging mechanism. If this is true, then a clever use of madvise might resolve your issue. The first thing I would try is MADV_HUGEPAGE. Then I will ensure that the access time is set properly, e.g., MADV_SEQUENTIAL. IIRC somewhere in JS Core_extened I’ve seen bindings to madvise.

XVilka · July 3, 2019, 4:39pm

Thanks, that is very good advice! I will try if it will improve the situation.

ivg · July 3, 2019, 4:47pm

btw, if you really need to create a subview, then you can create your own substring type on top of bigarray, that will basically have the following type definition

type view = {
   off : int;
   len : int;
   data : Bigstring.t;
}

Then the view function will just change off and len and won’t touch data and therefore ba_sub would never be called.

Surprisingly, this type already exists in BAP, it is called Memory.t And Memory.view is efficient as it doesn’t do any slices.

edwin · July 3, 2019, 4:57pm

List.append may not be the most efficient here, have you tried collecting your results in another way? (e.g. List.rev_append followed by final List.rev, or storing the results straight into another array)
Also I assume you have something that terminates your recursion and your offsets are always within bounds. If that is the case you could try using unsafe_get instead of get, and if you see any improvements abstract this away into a higher order function, e.g. yourbigarray -> start:int -> stop:int -> f:(int -> int -> 'a list) -> 'a list that performs bounds checking once.

dinosaure · July 3, 2019, 4:59pm

Close to what cstruct provides : https://github.com/mirage/ocaml-cstruct/blob/master/lib/cstruct.ml#L34

XVilka · July 3, 2019, 5:54pm

Thanks for the answers, I will try a bit of everything tomorrow.

Just for the record - there is no madvise in Core_extended , only in Lwt_unix.

XVilka · July 5, 2019, 5:43am

Thanks everyone! Applied every single method to improve performance, along with better algorithms to skip the uninteresting data - works like a charm!

Topic		Replies	Views
[solved] Unix file handling bindings Learning unix	10	1222	June 1, 2018
Mmap, munmap, mremap Community mmap	11	1429	August 27, 2021
Memmap a bigarray, then Unix.fork() (parallel initialization of a numerical array) Learning shm , parallel-programming	1	386	March 29, 2023
Performance difference between bigarray inside record and plain bigarray access Learning performance , beginner	8	748	August 16, 2021
Readonly Bigarray Ecosystem	15	1223	August 17, 2022

Working with a huge data chunks

Related topics