I need to write out to disk many floats from a Python script (but that could also be from a C program).
Later, I would like to read them as 32bit floats in OCaml.
What is the format I should use?
I want to use 32bit floats, because 64bit floats would be two times more data.
I guess 32 bits precision is way enough for what I am doing.
I guess some people at Jane Street might know that.
You can use Int32.float_of_bits
to read a single-precision IEEE 754 float into OCaml.
Cheers,
Nicolas
somehow I think of mmap, bigarray, cstruct.
Nice, so for integers, I should be using: Stdlib.output_binary_int and input_binary_int.
And for 32b floats, the extra step of: Int32.float_of_bits / bits_of_float.
I.e. we can read/write int32 and float32 from/to disk.
I will benchmark if this is faster than using Marshal.
For 64b floats, I see Int64.float_of_bits, but I don’t know if there is a function to write those 64 bits
to disk first. So I guess I will have to combine bit shifting with Stdlib.output_binary_int and input_binary_int.
Maybe something like CBOR would work for you? It’s a standardized binary serialization format that has a dedicated datatype for 32bit floats and there are implementations for OCaml and Python (and many other languages).
Shameless plug: I am the author of a CBOR implementation for OCaml: opam - cborl
No. These functions operate on the low 32 bits of values of type int
, meaning that on 32-bit platforms you’ll lose some bits.
For reliable encoding/decoding of 32 and 64-bit integers, please use Bytes.{get,set}_int{32,64}_{le,be,ne}
, which also let you control the endianness you want to use.
I think you should try hard to use an existing format, perhaps even library. There seems to be npy to read numpy data, by the way.
This might be too heavy handed for you but hdf5 is a decent choice for serializing and loading back up large numerical datasets, especially if your data is shaped like a typical dataframe.
It has a proven track record as it is often used by the scientific computing community and the finance industry as well. They like the fact that it’s high performance, standardized, and supports hierarchies and thus multiple datasets within one file. You can also memory map to it or use filters and chunking to avoid loading the entire file.
The biggest negative: there is only a complex C library implementation of it which is inevitably wrapped to other languages, including ocaml.
Ok, so in the end I ended up doing everything in OCaml; I use Int32 and Float32 in two Bigarrays.
My hand-written (de)serializer generates smaller files (75% of Marshal ones).
However, Marshal is faster at reading and writing (plus I don’t need to maintain more code).
So, I’ll stick with bigarrays, 32bits numbers and Marshal.
Maybe I could shave a few seconds by using Unix.map_file, but since my program is already
quite fast, I will not bother (also, it is Friday evening and I am a bit lazy/tired…).
Thanks for all the inputs.
The easiest way with the standard library in my opinion is
let b = Bytes.create 4 in
let i32 = Int32.bits_of_float <your_float_number> in
Bytes.set_int32_ne b 0 i32; (* or _be for big endian or _le for little endian *)
You can then do whatever you want with your bytes
A long while back i ve added IO.write_foat32 and read to extlib to do just that. A tremending contribution : 2 SLOC
Yes! It is still in batteries: BatIO.{read_float|write_float}.
I never use BatIO, but I should have had a look at their code.
I dug out their code from this file https://github.com/ocaml-batteries-team/batteries-included/blob/d471e24712dd1c0adb90db6894c1c721078b3934/src/batIO.ml:
let read_real_i32 ch =
let big = Int32.shift_left (Int32.of_int (read_byte ch)) 24 in
let ch3 = read_byte ch in
let ch2 = read_byte ch in
let ch1 = read_byte ch in
let base = Int32.of_int (ch1 lor (ch2 lsl 8) lor (ch3 lsl 16)) in
Int32.logor base big
let read_float ch =
Int32.float_of_bits (read_real_i32 ch)
let write_real_i32 ch n =
let base = Int32.to_int n in
let big = Int32.to_int (Int32.shift_right_logical n 24) in
write_byte ch big;
write_byte ch (base lsr 16);
write_byte ch (base lsr 8);
write_byte ch base
let write_float ch f =
write_real_i32 ch (Int32.bits_of_float f)
Answer to self:
apparently, Unix.map_file (retrieve bigarray from file) is several oders of magnitude faster
than unmarshaling a float array from a file.
The write operation (using a shared mmapped bigarray) is slightly faster than marshalling a float array to file.
Is there a way in 2024 to read and write floats that are 32bit or 16 bit? Maybe a library for floats like stdint
? I am working on a project to read floats from a store and such floats can be 16bit, 32bit or 64bit. I also need to be able to write such floats back into the store. How can this be achieved?
I suspect you can devise something with bigarrays (which support 16 and 32 bits floats) and the brand new (5.2) In_channel.input_bigarray
and Out_channel.output_bigarray
functions. Or Unix.write_bigarray
and Unix.read_bigarray
if you are working with fd
s.
Mmm, I am not sure if the In/Out_channel
could be helpful. To be clear, I am trying to implement the chunked multidimensional array specification described here. The element data needs to be read from bytes stored in a “store” (a store here could be any key-value data structure like a directory, hashtable or s3 bucket) and represented in serialized as the datatype mentioned in its metadata file. Updating an element requires data to be written back to the store as bytes (by serializing the value represented using the correct data type).
My issue is that for floats, OCaml only supports 64bit floats so I would not be able to successfully read element data from a store using the specified data type if its not a 64 bit float. I am looking to see if there is a way I can work around this. I was hoping there is a library I could use to easily work with varying precision floating values.
That’s true. But using bigarrays you can have in memory arrays of {16,32,64}-bit floats and interact with them using these 64bit floats. Not sure if that’s enough for you. I guess it is if you don’t expect to crunch numbers in OCaml itself but, for example, simply ship these floats to a GPU.
Even when using CPU, it is better to runtime-compile the specialized algorithm (via C or another portable assembly language), otherwise there’s unnecessary overhead.