How to reliably build a Bigarray of specific element type using a string

I have a problem where I need to create a Bigarray of specific element type using information read a json file. I must read a string from a json field wich can be one of {"float32", "float64", "int8", "int16", "int32", "int64", "complex32", "complex64"}. Based on the string value I must construct an empty Bigarray whose elements have the datatype implied by the string value. For example, “float32” means a big array of Bigarray.Float32 kind type must be created.

I tried writing a function to do the conversion but it won’t pass the type checker:

module B  = Bigarray
  let of_datatype dims = function
    | "float64" -> B.Genarray.create B.float64 B.c_layout dims
    | "float32" -> B.Genarray.create B.float32 B.c_layout dims
    | "int8" ->  B.Genarray.create B.int8_signed B.c_layout dims
    ...

Is there an alternative approach I can use to accomplish the same result?

It would be nice to show us the type signature you are trying to implement. But it’s likely that you will have to wrap your bigarrays in an existential to hide the polymorphic types. Something like:

type bigarray = Bigarray : ('a, 'b, B.c_layout) B.Genarray.t -> bigarray

let of_datatypes dim = function 
| "float64" -> Bigarray (B.Genarray.create B.float64 B.c_layout dims)
…

Before reaching for GADTs, it might be better to check if the of_datatypes function is needed. In particular, if once parsed the various cases are dispatched to different code paths, a simpler solution might be to dispatch directly the different cases with:

match input_kind with
| "float64" -> float_case (B.Genarray.create B.float64 B.c_layout dims)
| "float32" -> float32_case ...
  ...

(which is equivalent of having a record of continuations in of_datatypes for each case, which is equivalent to hiding the bigarray types behind an existential quantification, but I think it is better to avoid climbing up the type complexity ladder by accident).

2 Likes

It may help to simplify this problem by not thinking about Bigarrays for a moment. In essence, the problem is the same as this:

let of_datatype = function
  | "string" -> "a string"
  | "float" -> 3.14
  | "int" ->  100

which will not pass the typechecker for the same reason. In OCaml, an expression must be one type or another. If you need to represent different possible types then you’ll have to wrap them in a variant type, apply different continuations to them, or something else like that (as illustrated above).

I think the real question is, what do you need to do with your function? If you call it with let arr = of_datatype dims str in, then what do you expect arr to be? What type would arr.{0} return? Although there are techniques that can make of_datatype type-check, it’s possible that there’s a deeper issue with the rest of your code that makes it appear to need an ill-typed function.

2 Likes

I tried this but it doesn’t work when I try to recover the type of the bigarray. What i’m trying to do is write a library for large chunked and compressed N-dimensional arrays; basically implementing the spec outlined here: Zarr core specification (version 3.0) — Zarr specs documentation . Reading an array chunk from its underlying store involves using the array metadata to decompress the array bytes of a chunk and decode it in a series of steps, update its values and then compress it again and write it to its underlying store.

What I’d like to achieve is to keep metadata about the Bigarray.kind, fill value and shape of the underlying chunk so that I can use that information to properly decode the array bytes. However this approach hides the Bigarray kind and when I try to recover it using pattern matching , the information is lost and I get errors about the type escaping it’s scope.

Is there a way I can reliably store the metadata of the Bigarray type so I can use it later when encoding/decoding the array bytes stored to disk (or wherever)?

The information is not lost but you need to be careful in the way you pack it and you can only unpack an existential and use it in a limited scope. Besides you may have to write some of the function types using the packed information explictely. It’s difficult to help you without showing us a minimal example that exhibits the problem you are facing.

This is where I use the packed type to try and extract the kind, fill_value and shape of the array: zarr-ml/lib/store.ml at beb82892cd13ce79df2a155b556f4d412a9c0bf3 · zoj613/zarr-ml · GitHub

The implementation of Extension.DataKind is here: zarr-ml/lib/extension.ml at beb82892cd13ce79df2a155b556f4d412a9c0bf3 · zoj613/zarr-ml · GitHub

The attempt was to pass in this type: zarr-ml/lib/common.ml at beb82892cd13ce79df2a155b556f4d412a9c0bf3 · zoj613/zarr-ml · GitHub when decoding the array bytes so that the decoding logic can have access to the shape, Bigarray.kind and fill_value of the array. But all the techniques I tried lead to type check errors, and packing the existential types seems to not help at all.

Do you have any suggestion on how I can simplify things? It seems like working with Bigarrays is a pain but maybe it’s a skillset issue??!

BTW, the main working branch code is here: zarr-ml/lib at wip · zoj613/zarr-ml · GitHub

You are going against the grain of the library by trying to use the bigarray type as a a single type rather than a family of types, which requires a lot of GADTs.

Since it doesn’t sound like you ever use the ability to distinguish between the array types, how about using a type that merge together all bigarray types that you use:

open Bigarray
type barray =
  | Char of (char, int8_unsigned_elt, c_layout) Genarray.t
  | Int of (int,int_elt,c_layout) Genarray.t
  | Float of (float, float_elt, c_layout) Genarray.t

Then you can create a metadata type that hold the information on how to fills the arrays

module K = struct
  type fillable =
    | Char of char
    | Int of int
    | Float of float
end

and use it to create a barray of the correct kind without GADTs

let create_and_fill fkind shape =
  let fcreate k filler =
    let b = Genarray.create k c_layout shape in
    Genarray.fill b filler;
    b
  in
  match fkind with
  | K.Char c -> Char (fcreate Char c)
  | K.Int i -> Int (fcreate Int i)
  | K.Float f -> Float (fcreate Float64 f)

with no GADTs in sight.

Using GADTs is a way to have more piecewise control on the weaving of type information and data layout, but at the cost on an increase of complexity.

If you wish to go on the GADT path, it would be easier to help with specific examples.
Without such examples, a few rules of thumb when working with GADTs are:

  • write the type of your functions first
  • existential quantification allow to make type information local. Local information cannot be made global, but it can be compared to global information.
1 Like

@octachron Thanks a ton for the detailed response. It certainly opened up my eyes and I now realize I could implement the logic a bit simpler. So I minted a new Datatype module to parse/serialize datatype information from the json metadata file: zarr-ml/lib/extension.ml at 52a6ae0f831750d5f9c99974fee6ac5772b44fb1 · zoj613/zarr-ml · GitHub.

This is then used inside the set_array function through pattern matching as shown here: zarr-ml/lib/store.ml at 52a6ae0f831750d5f9c99974fee6ac5772b44fb1 · zoj613/zarr-ml · GitHub

This seems to get rid of the scope errors I was getting…but now I have a slighly different problem with the chained Result monads. I get a type check error when trying to compile the code:

File "lib/store.ml", lines 98-105, characters 8-60:
 98 | ........let* b = get chunkkey t in
 99 |         let* arr = if String.(equal b empty) then
100 |           Ok (Ndarray.create kind shape fill_value)
101 |         else
102 |           Chain.decode chain repr b
103 |         in
104 |         List.iter (fun (coord, y) -> Ndarray.set arr coord y) vals;
105 |         Result.map (set t chunkkey) (Chain.encode chain arr).....
Error: This expression has type
         (unit,
          [> `Bytes_decode_error of key
           | `Bytes_encode_error of key
           | `Crc32c_decode_error of key
           | `Crc32c_encode_error of key
           | `Gzip of Ezgzip.error
           | `Invalid_byte_range of key
           | `Key_not_found of key
           | `Transpose_decode_error of key * int * int
           | `Transpose_encode_error of key * int * int ])
         result
       but an expression was expected of type unit

My editor’s LSP server points to lines 96, 98 and 105 of this snippet: zarr-ml/lib/store.ml at 52a6ae0f831750d5f9c99974fee6ac5772b44fb1 · zoj613/zarr-ml · GitHub

My guess is that because the iter function returns unit and not a value. I am not sure how to compose this imperative piece of code with the rest of the monad chain in the set_array function. Is there a pattern I need to adopt that im not aware of yet?

If the iter function cannot fail, you can just fix the type of perform. If a step of an iteration can fail, then iter is not the right function. You could use fold to implement a variant of

let iter_until f a =
  Array.fold_left (fun acc x -> let* () = acc in f x) (Ok ()) a
1 Like

That was very helpful and I restructured the code to : zarr-ml/lib/store.ml at adcfacecd92e55c940276fde5fdd4e168980f3ad · zoj613/zarr-ml · GitHub

This got rid of the error…but now I have a new one regarding scope:

File "lib/store.ml", line 124, characters 62-65:
124 |         Result.map (set t chunkkey) (Chain.encode chain arr)) tbl (Ok ()) in
                                                                    ^^^
Error: This expression has type (int array * a/2) list Arraytbl.t
       but an expression was expected of type (int array * a) list Arraytbl.t
       The type constructor a would escape its scope
       File "lib/store.ml", lines 112-124, characters 8-73:
         Definition of type a
       File "lib/store.ml", lines 92-132, characters 6-58:
         Definition of type a/2

I tried passing tbl in as an argument and also explicitly annotating its type but this made the expression Ndarray.to_array x in line 111 produce the same scoping error. Not sure how to proceed

The error message is telling you that the perform function is not polymorphic since it using a table with a type a coming from the binding in the set_array function. In particular, there is no reasons to assume that the type a from the kind argument of perform has anything to do with the type a of the Ndarray.

I would recommend to avoid nesting functions with universal quantifications, and if you do so, it would be clearer to use different type names:

let set_array: type a b. 
  Path.t -> Owl_types.slice -> (a, b) Ndarray.t -> t
  -> (unit, [> set_error]) result
= ...
let perform 
      : type elt tag. (elt, tag) Bigarray.kind -> elt 
  -> (unit, [> set_error]) result
= ...

However, in this case the problem starts with the type:

let set_array: type a b. 
  Path.t -> Owl_types.slice -> (a, b) Ndarray.t -> t
  -> (unit, [> set_error]) result

With this type, you are promising that the function set_array works with any type a and b with no information on those types. This can only work if the set_array avoid any kind of operations that requires to know those types which is not the case here.

To avoid this conundrum, you need to add some runtime information about the Ndarray that you are manipulating. For instance, you could add a kind argument to convey this information.

But once again, it is not clear what you gain in your use case by piling up GADTs, compared to replacing the ndarray type by the simpler barray variant that I showed you previously.

1 Like

The set_array function is userfacing and the caller passes in a bigarray with values such that those values get written to the chunked array in the underlying storage. The Owl_slicing.slice types gives information about where in the chunked array to write the values. Wouldn’t hiding the array information by requiring the caller to wrap the Ndarray in a darray type make the API a bit hard to use?

In any case, I updated the implementation to make use of the information provided by the input array x and got rid of the scoping errors. Basically I extracted the kind and fill_value from it. The fill_value isn’t exactly used so its just a “stub” used to instantiate a array_repr record. Here is the updated implementation : zarr-ml/lib/store.ml at 72d22d0bd37d2014d48df54f59b4d24f4a2217fc · zoj613/zarr-ml · GitHub

Ah, I did forget that Bigarray provided a kind function. However, beware that your code is wrong for zero sized array and that you should not use to_array which performs a full copy of the array to extract the first element.

Also since Bigarray provides a kind function, you can write a function
('a,'b) ndarray -> any_ndarray functions and only use the any_ndarray type internally.

I would advise to not prioritize optimizing your API for users that you don’t have before having a working library . It will be easier to tune the exposed API once your library is working, and you can identify potential pain points.

1 Like

To me, this looks like you should simply provide a function that translates the string to a type.
So “float64” → Float64, “int8” → Int8 etc.
Then the user just has to call exactly that function that returns a bigarray of exactly the wanted bigarray. Say bigarray_f64, bigarray_i8 etc.
The user of your lib knows what he wants. Just provide him the services to simply get that. If the user is undecided what he wants, it is his turn to make up his mind.

But maybe this is to pragmatic …

Thank you for this piece of advice, it helped me make good progress since I last posted here. I now got the set_array and get_array functions working well. I just need to start thinking about how to include concurrency for greater performance since array “chunks” are independent from each other and can be processed in parallel.

Thanks for pointing this out. It actually led me to a better solution by miniting a fillvalue_of_kind function since I now require Bigarray.kind as a parameter to the function get_array. It made things a lot easier to implement. See: zarr-ml/lib/common.ml at 8cfe377f5a4bf3058ab864f280a505664fb433cd · zoj613/zarr-ml · GitHub and how it is used at the call site: zarr-ml/lib/store.ml at 8cfe377f5a4bf3058ab864f280a505664fb433cd · zoj613/zarr-ml · GitHub

I’m not sure if I follow you here, could you please ellaborate?

I figured that I probably don’t need the function in the OP, at the call site I now require the user to pass in a Bigarray.kind type as a parameter to simplify things (which automatically takes care of all data types). see : zarr-ml/lib/store.ml at 8cfe377f5a4bf3058ab864f280a505664fb433cd · zoj613/zarr-ml · GitHub