I have started to learn Ocaml a few weeks ago and I have made this little project . It’s a kind of “dataframe” library exposing a data structure to store heterogeneous collection of data organized by series (or columns) of the same type. More serious implementation of this are the pandas and polars libraries. In the ocaml world there is a library(ocaml-dataframe) which implements this too (I didn’t know it before I started this project).
I store data serie (aka column) in two forms(dataframe.ml l112) as an array and as a sequence derived from another (in array or sequence form):
type t =
| Source of Ftype.t array
| Derived of Ftype.t Seq.t
I wanted to know what were the performances of this structure and the function derive(dataframe.ml l174) which build a new serie from another. So I have made a little benchmark(benchmark/serie.ml).
The time performance looks acceptable but I have noticed an huge memory consumption when I run this program. It consumes indeed roughly 17GB to run for a single array on 100,000,000 elements and sequences build on this array. In my mind the additional cost of sequence build on this array should be minimal. Is it really the case ?
what accounts for this consumption?
I’ve read that data stored with adhoc types (as I did with ftype in dataframe.ml l4) has an additional cost since it’s boxed data. Could this explain the additional cost ?
In a previous implementation I stored date in array in their unboxed form and have a type for columns. But this was complicating my code quite a bit and I changed to the current implementation.
The main difficulty with this project is that I’d like it to be possible to operate on two columns of different types, while maintaining relative type safety and keeping good performances (time and space) in the same time.
How would you do it? I’d be glad to read your advices.
My guess is that your program spends most of its time converting between sequences and other things. Try switching Serie.t to just Ftype.t array Lazy.t, removing all occurrences of Seq, and see what you get.
In general using Seq is not a bad idea, but its performance profile is subtle and it’s easy to mess things up.
Also, the IntSet.t part of the dataframe can be replaced by a simple int, as you only need to store the number of elements (a set would make sense if you had support for missing indices, but even then you could probably find better representations).
I have just tested the benchmark without deriving data like this:
let integer_serie_stress_test () =
let f = fun () ->
Seq.repeat 100
|> Seq.take it
|> Dataframe.Serie.of_int_seq
(*|> Dataframe.Serie.derive
( Dataframe.Ftype.from_int_to_int ( ( * ) 50) ) *)in
time f
And it appears it’s the array which consumes the memory. I guess because of the Boxed value. Let me know if I compute this properly.
I have 100,000,000 Integers so I guess they need each 8 Bytes of storage and a 8 bytes pointer. Let say numeric need twice as there are two integers in their structure + a pointer so roughly 8 * 3 = 24 Bytes. For the string each one use 2 Bytes as they represent an integer from 1 to 100. so at most (with the pointer) 10 Bytes.
This give us: 1,600,000,000 + 2,400,000,000 + 1,000,000,000 = 5GB
I am surely wrong in this approximation but we are far from the 20GB the program actually consumes.
let n = 100_000_000
let f () =
(* Takes ~ 5 GB *)
Seq.repeat 100 |> Seq.take n |> Seq.map (fun i -> Int i)|> Array.of_seq
let g () =
(* Takes ~ 2.3 GB *)
Array.init n (fun i -> Int i)
let h () =
(* Also takes 2.3 GB *)
let a = Array.make n (Int 0) in
Seq.repeat 100 |> Seq.take n |> Seq.iteri (fun i x -> a.(i) <- Int x);
a
f is roughly what your code is doing. g is a straightforward allocation + initialization of the array. Note that forg I use Array.init and allocate a new Int i for each cell. I end up using 2.3GB which is correct:
100_000_000 elements x 1 pointer per cell x a bloc of size 2 (the header and the single integer). h gives the same behavior as g. For f, looking at the code of Array.of_seq, since one cannot know the lenght of the seq in advance, the seq is first turned into a list, then the list length is computed then the array is allocated an filled with the content of the list.
So even if you do a Gc.full_major() after, the cells of the list are still reclaimed, but since the last allocated thing is the array I don’t think the memory can be given back to the OS ? (wild guess). If I then exit the function and call Gc.full_major() again, then this time, memory is given back to the OS and I see a modest memory consumption of a few mega bytes (from ~ 5GB).
If you do something like f in tight loop, maybe this accumulates ? (all of these are wild guesses).
Edit : pasted the wrong code for f which was not boxing integers.
Indeed the fact to create the serie from a sequence consumes more memory. This is weird I don’t understand why exactly. But it’s a fact. I have made a function to build directly from a Ftype.t array and saved 8GB.
Here the result from the former benchmark:
$ command time -f "Total memory consumption: %MKb" -- dune exec benchmark/serie.exe
100_000_000 integer serie stress test
Execution time: 34.471662s
===========================================================
100_000_000 numeric serie stress test
Execution time: 38.916575s
===========================================================
100_000_000 string serie stress test
Execution time: 39.948984s
===========================================================
Total execution time: 41.931293s
Total memory consumption: 18783844Kb
$ command time -f "Total memory consumption: %MKb" -- dune exec benchmark/serie_direct.exe
100_000_000 integer serie stress test
Execution time: 14.339211s
===========================================================
100_000_000 numeric serie stress test
Execution time: 15.817362s
===========================================================
100_000_000 string serie stress test
Execution time: 22.598401s
===========================================================
Total execution time: 23.363822s
Total memory consumption: 10956288Kb
What’s even stranger is that using the derive function has virtually no cost, even though it’s also a sequence. If I don’t use it I have roughly the same consumption:
$ command time -f "Total memory consumption: %MKb" -- dune exec benchmark/serie_no_deriv.exe
100_000_000 integer serie stress test
Execution time: 14.708946s
===========================================================
100_000_000 numeric serie stress test
Execution time: 15.763159s
===========================================================
100_000_000 string serie stress test
Execution time: 22.949356s
===========================================================
Total execution time: 24.808437s
Total memory consumption: 10140900Kb
The consumption is still big (surely because of boxed items in the array) but better. And execution time is greatly improved.
For f, looking at the code of Array.of_seq, since one cannot know the lenght of the seq in advance, the seq is first turned into a list, then the list length is computed then the array is allocated an filled with the content of the list.
This makes sense good catch. I’m going to try if I can do a garbage collection before run the test in the original benchmark.
Interesting, I have changed the way I build the array from sequence and I got results closer to those obtained by creating the array directly (with a slight additional cost as I need to initialize the array).
$ command time -f "Total memory consumption: %MKb" -- dune exec benchmark/serie.exe
100_000_000 integer serie stress test
Execution time: 25.965392s
===========================================================
100_000_000 numeric serie stress test
Execution time: 28.949169s
===========================================================
100_000_000 string serie stress test
Execution time: 28.716412s
===========================================================
Total execution time: 31.256429s
Total memory consumption: 11745096Kb
So Array.of_seq looks not optimal for my need. I guess for the reason you point out. I think the consumption which stay big is due to the boxed value. I think the best would be to change the Serie so that It store native type and have a type directly on Serie to track the type of data they represent.
Be careful that there is a huge difference between doing Array.of_seq on one side, and Seq.length followed by Seq.iteri on the other side. Indeed, the latter evaluates the sequence twice. This means that, if the sequence is ephemeral (e.g., directly read from a file), then the array will not contain anything at the end, because all the elements will already have been consumed by Seq.length.