How to dump many floats in binary format so that OCaml can read them in later

UnixJunkie · June 8, 2022, 1:15pm

I need to write out to disk many floats from a Python script (but that could also be from a C program).
Later, I would like to read them as 32bit floats in OCaml.
What is the format I should use?
I want to use 32bit floats, because 64bit floats would be two times more data.
I guess 32 bits precision is way enough for what I am doing.

UnixJunkie · June 8, 2022, 1:20pm

I guess some people at Jane Street might know that.

nojb · June 8, 2022, 1:57pm

You can use Int32.float_of_bits to read a single-precision IEEE 754 float into OCaml.

Cheers,
Nicolas

mro · June 8, 2022, 8:42pm

somehow I think of mmap, bigarray, cstruct.

UnixJunkie · September 26, 2022, 8:01am

Nice, so for integers, I should be using: Stdlib.output_binary_int and input_binary_int.
And for 32b floats, the extra step of: Int32.float_of_bits / bits_of_float.
I.e. we can read/write int32 and float32 from/to disk.
I will benchmark if this is faster than using Marshal.

For 64b floats, I see Int64.float_of_bits, but I don’t know if there is a function to write those 64 bits
to disk first. So I guess I will have to combine bit shifting with Stdlib.output_binary_int and input_binary_int.

pukkamustard · September 27, 2022, 8:15am

Maybe something like CBOR would work for you? It’s a standardized binary serialization format that has a dedicated datatype for 32bit floats and there are implementations for OCaml and Python (and many other languages).

Shameless plug: I am the author of a CBOR implementation for OCaml: opam - cborl

xavierleroy · September 27, 2022, 5:07pm

No. These functions operate on the low 32 bits of values of type int, meaning that on 32-bit platforms you’ll lose some bits.

For reliable encoding/decoding of 32 and 64-bit integers, please use Bytes.{get,set}_int{32,64}_{le,be,ne}, which also let you control the endianness you want to use.

c-cube · September 27, 2022, 5:30pm

I think you should try hard to use an existing format, perhaps even library. There seems to be npy to read numpy data, by the way.

struktured · September 27, 2022, 6:53pm

This might be too heavy handed for you but hdf5 is a decent choice for serializing and loading back up large numerical datasets, especially if your data is shaped like a typical dataframe.

It has a proven track record as it is often used by the scientific computing community and the finance industry as well. They like the fact that it’s high performance, standardized, and supports hierarchies and thus multiple datasets within one file. You can also memory map to it or use filters and chunking to avoid loading the entire file.

The biggest negative: there is only a complex C library implementation of it which is inevitably wrapped to other languages, including ocaml.

UnixJunkie · September 30, 2022, 8:47am

Ok, so in the end I ended up doing everything in OCaml; I use Int32 and Float32 in two Bigarrays.
My hand-written (de)serializer generates smaller files (75% of Marshal ones).
However, Marshal is faster at reading and writing (plus I don’t need to maintain more code).
So, I’ll stick with bigarrays, 32bits numbers and Marshal.

Maybe I could shave a few seconds by using Unix.map_file, but since my program is already
quite fast, I will not bother (also, it is Friday evening and I am a bit lazy/tired…).

Thanks for all the inputs.

levillain.maxime · October 4, 2022, 1:24pm

The easiest way with the standard library in my opinion is

let b = Bytes.create 4 in
let i32 = Int32.bits_of_float <your_float_number> in
Bytes.set_int32_ne b 0 i32; (* or _be for big endian or _le for little endian *)

You can then do whatever you want with your bytes

Philippe_Strauss · October 5, 2022, 12:38am

A long while back i ve added IO.write_foat32 and read to extlib to do just that. A tremending contribution : 2 SLOC

UnixJunkie · October 5, 2022, 1:23am

Yes! It is still in batteries: BatIO.{read_float|write_float}.
I never use BatIO, but I should have had a look at their code.

UnixJunkie · October 5, 2022, 1:27am

I dug out their code from this file https://github.com/ocaml-batteries-team/batteries-included/blob/d471e24712dd1c0adb90db6894c1c721078b3934/src/batIO.ml:

let read_real_i32 ch =
  let big = Int32.shift_left (Int32.of_int (read_byte ch)) 24 in
  let ch3 = read_byte ch in
  let ch2 = read_byte ch in
  let ch1 = read_byte ch in
  let base = Int32.of_int (ch1 lor (ch2 lsl 8) lor (ch3 lsl 16)) in
  Int32.logor base big

let read_float ch =
  Int32.float_of_bits (read_real_i32 ch)

let write_real_i32 ch n =
  let base = Int32.to_int n in
  let big = Int32.to_int (Int32.shift_right_logical n 24) in
  write_byte ch big;
  write_byte ch (base lsr 16);
  write_byte ch (base lsr 8);
  write_byte ch base

let write_float ch f =
  write_real_i32 ch (Int32.bits_of_float f)

UnixJunkie · April 11, 2023, 4:33am

Answer to self:
apparently, Unix.map_file (retrieve bigarray from file) is several oders of magnitude faster
than unmarshaling a float array from a file.
The write operation (using a shared mmapped bigarray) is slightly faster than marshalling a float array to file.

zoj613 · May 15, 2024, 9:22am

Is there a way in 2024 to read and write floats that are 32bit or 16 bit? Maybe a library for floats like stdint? I am working on a project to read floats from a store and such floats can be 16bit, 32bit or 64bit. I also need to be able to write such floats back into the store. How can this be achieved?

dbuenzli · May 15, 2024, 9:39am

I suspect you can devise something with bigarrays (which support 16 and 32 bits floats) and the brand new (5.2) In_channel.input_bigarray and Out_channel.output_bigarray functions. Or Unix.write_bigarray and Unix.read_bigarray if you are working with fds.

zoj613 · May 15, 2024, 9:54am

Mmm, I am not sure if the In/Out_channel could be helpful. To be clear, I am trying to implement the chunked multidimensional array specification described here. The element data needs to be read from bytes stored in a “store” (a store here could be any key-value data structure like a directory, hashtable or s3 bucket) and represented in serialized as the datatype mentioned in its metadata file. Updating an element requires data to be written back to the store as bytes (by serializing the value represented using the correct data type).

My issue is that for floats, OCaml only supports 64bit floats so I would not be able to successfully read element data from a store using the specified data type if its not a 64 bit float. I am looking to see if there is a way I can work around this. I was hoping there is a library I could use to easily work with varying precision floating values.

dbuenzli · May 15, 2024, 10:04am

That’s true. But using bigarrays you can have in memory arrays of {16,32,64}-bit floats and interact with them using these 64bit floats. Not sure if that’s enough for you. I guess it is if you don’t expect to crunch numbers in OCaml itself but, for example, simply ship these floats to a GPU.

lukstafi · May 15, 2024, 10:50am

Even when using CPU, it is better to runtime-compile the specialized algorithm (via C or another portable assembly language), otherwise there’s unnecessary overhead.

Topic		Replies	Views
Read, write floats in binary representation? Learning	10	2056	January 19, 2020
How to generate a 32-bit binary on a 64-bit system? Community build	9	3589	April 20, 2021
I8, u8, i16, u16, i32, u32 arrays Learning	3	584	March 16, 2023
Libraries for parsing binary formats? Ecosystem	9	3105	July 2, 2019
Rust's Vec<u8> == OCaml's? Learning	5	670	June 26, 2023

How to dump many floats in binary format so that OCaml can read them in later

Related topics