How to dump many floats in binary format so that OCaml can read them in later

I need to write out to disk many floats from a Python script (but that could also be from a C program).
Later, I would like to read them as 32bit floats in OCaml.
What is the format I should use?
I want to use 32bit floats, because 64bit floats would be two times more data.
I guess 32 bits precision is way enough for what I am doing.

1 Like

I guess some people at Jane Street might know that. :wink:

You can use Int32.float_of_bits to read a single-precision IEEE 754 float into OCaml.

Cheers,
Nicolas

3 Likes

somehow I think of mmap, bigarray, cstruct.

2 Likes

Nice, so for integers, I should be using: Stdlib.output_binary_int and input_binary_int.
And for 32b floats, the extra step of: Int32.float_of_bits / bits_of_float.
I.e. we can read/write int32 and float32 from/to disk.
I will benchmark if this is faster than using Marshal.

For 64b floats, I see Int64.float_of_bits, but I don’t know if there is a function to write those 64 bits
to disk first. So I guess I will have to combine bit shifting with Stdlib.output_binary_int and input_binary_int.

1 Like

Maybe something like CBOR would work for you? It’s a standardized binary serialization format that has a dedicated datatype for 32bit floats and there are implementations for OCaml and Python (and many other languages).

Shameless plug: I am the author of a CBOR implementation for OCaml: opam - cborl

No. These functions operate on the low 32 bits of values of type int, meaning that on 32-bit platforms you’ll lose some bits.

For reliable encoding/decoding of 32 and 64-bit integers, please use Bytes.{get,set}_int{32,64}_{le,be,ne}, which also let you control the endianness you want to use.

4 Likes

I think you should try hard to use an existing format, perhaps even library. There seems to be npy to read numpy data, by the way.

2 Likes

This might be too heavy handed for you but hdf5 is a decent choice for serializing and loading back up large numerical datasets, especially if your data is shaped like a typical dataframe.

It has a proven track record as it is often used by the scientific computing community and the finance industry as well. They like the fact that it’s high performance, standardized, and supports hierarchies and thus multiple datasets within one file. You can also memory map to it or use filters and chunking to avoid loading the entire file.

The biggest negative: there is only a complex C library implementation of it which is inevitably wrapped to other languages, including ocaml.

3 Likes

Ok, so in the end I ended up doing everything in OCaml; I use Int32 and Float32 in two Bigarrays.
My hand-written (de)serializer generates smaller files (75% of Marshal ones).
However, Marshal is faster at reading and writing (plus I don’t need to maintain more code).
So, I’ll stick with bigarrays, 32bits numbers and Marshal.

Maybe I could shave a few seconds by using Unix.map_file, but since my program is already
quite fast, I will not bother (also, it is Friday evening and I am a bit lazy/tired…). :grin:

Thanks for all the inputs.

3 Likes

The easiest way with the standard library in my opinion is

let b = Bytes.create 4 in
let i32 = Int32.bits_of_float <your_float_number> in
Bytes.set_int32_ne b 0 i32; (* or _be for big endian or _le for little endian *) 

You can then do whatever you want with your bytes

A long while back i ve added IO.write_foat32 and read to extlib to do just that. A tremending contribution : 2 SLOC :yum:

1 Like

Yes! It is still in batteries: BatIO.{read_float|write_float}.
I never use BatIO, but I should have had a look at their code.

I dug out their code from this file https://github.com/ocaml-batteries-team/batteries-included/blob/d471e24712dd1c0adb90db6894c1c721078b3934/src/batIO.ml:

let read_real_i32 ch =
  let big = Int32.shift_left (Int32.of_int (read_byte ch)) 24 in
  let ch3 = read_byte ch in
  let ch2 = read_byte ch in
  let ch1 = read_byte ch in
  let base = Int32.of_int (ch1 lor (ch2 lsl 8) lor (ch3 lsl 16)) in
  Int32.logor base big

let read_float ch =
  Int32.float_of_bits (read_real_i32 ch)

let write_real_i32 ch n =
  let base = Int32.to_int n in
  let big = Int32.to_int (Int32.shift_right_logical n 24) in
  write_byte ch big;
  write_byte ch (base lsr 16);
  write_byte ch (base lsr 8);
  write_byte ch base

let write_float ch f =
  write_real_i32 ch (Int32.bits_of_float f)
1 Like

Answer to self:
apparently, Unix.map_file (retrieve bigarray from file) is several oders of magnitude faster
than unmarshaling a float array from a file.
The write operation (using a shared mmapped bigarray) is slightly faster than marshalling a float array to file.

1 Like