How to dump many floats in binary format so that OCaml can read them in later

I need to write out to disk many floats from a Python script (but that could also be from a C program).
Later, I would like to read them as 32bit floats in OCaml.
What is the format I should use?
I want to use 32bit floats, because 64bit floats would be two times more data.
I guess 32 bits precision is way enough for what I am doing.

I guess some people at Jane Street might know that. :wink:

You can use Int32.float_of_bits to read a single-precision IEEE 754 float into OCaml.

Cheers,
Nicolas

3 Likes

somehow I think of mmap, bigarray, cstruct.

2 Likes

Nice, so for integers, I should be using: Stdlib.output_binary_int and input_binary_int.
And for 32b floats, the extra step of: Int32.float_of_bits / bits_of_float.
I.e. we can read/write int32 and float32 from/to disk.
I will benchmark if this is faster than using Marshal.

For 64b floats, I see Int64.float_of_bits, but I don’t know if there is a function to write those 64 bits
to disk first. So I guess I will have to combine bit shifting with Stdlib.output_binary_int and input_binary_int.

1 Like

Maybe something like CBOR would work for you? It’s a standardized binary serialization format that has a dedicated datatype for 32bit floats and there are implementations for OCaml and Python (and many other languages).

Shameless plug: I am the author of a CBOR implementation for OCaml: opam - cborl

No. These functions operate on the low 32 bits of values of type int, meaning that on 32-bit platforms you’ll lose some bits.

For reliable encoding/decoding of 32 and 64-bit integers, please use Bytes.{get,set}_int{32,64}_{le,be,ne}, which also let you control the endianness you want to use.

2 Likes

I think you should try hard to use an existing format, perhaps even library. There seems to be npy to read numpy data, by the way.

1 Like

This might be too heavy handed for you but hdf5 is a decent choice for serializing and loading back up large numerical datasets, especially if your data is shaped like a typical dataframe.

It has a proven track record as it is often used by the scientific computing community and the finance industry as well. They like the fact that it’s high performance, standardized, and supports hierarchies and thus multiple datasets within one file. You can also memory map to it or use filters and chunking to avoid loading the entire file.

The biggest negative: there is only a complex C library implementation of it which is inevitably wrapped to other languages, including ocaml.

1 Like

Ok, so in the end I ended up doing everything in OCaml; I use Int32 and Float32 in two Bigarrays.
My hand-written (de)serializer generates smaller files (75% of Marshal ones).
However, Marshal is faster at reading and writing (plus I don’t need to maintain more code).
So, I’ll stick with bigarrays, 32bits numbers and Marshal.

Maybe I could shave a few seconds by using Unix.map_file, but since my program is already
quite fast, I will not bother (also, it is Friday evening and I am a bit lazy/tired…). :grin:

Thanks for all the inputs.

2 Likes