It mostly depends on your allocation policy and the life of your buffers. The question is difficult and it depends on your context but I can list some particularities about both:
-
bigstring
/bigarray
can not be relocated by the GC. That mostly means that the buffer will never move even if the GC enters into a cycle. - Due to the non-relocation of the buffer, we can release the GC lock. This is what happens for
digestif
which is an implementation of several hash algorithms. We know that these algorithms mostly “calculate”. They don’t do an allocation for instance. So we are able to say that the upcoming computation can be done regardless the GC and in the context oflwt
/async
(or evenmulticore
), it allows a kind of true parallelism so. -
Bigarray.sub
allocate a “proxy” of the initialbigarray
. A sub does not copy thebigarray
and gives you a smaller representation which permits an access to a slice of thebigarray
. An example ismirage-tcpip
which introspects the TCP/IP packet by a succession ofsub
- which permits a zero-copy between the given packet and the application layer.- For this specific aspect, the reality is a bit more complex. Indeed, even if we want to allocate a smaller representation of the given
bigarray
(a slice), this representation will be allocated into the major heap (but I think it’s not true anymore due to this commit). This is whycstruct
appeared as a solution to keep the ability to get some slices from abigarray
and allocate them into the minor heap (which is faster than the major heap). From that, a nice API exists now to manipulatebigarray
and take this particular advantage.
- For this specific aspect, the reality is a bit more complex. Indeed, even if we want to allocate a smaller representation of the given
-
Specialization on
int32 Bigarray
andint64 Bigarray
is done by the compiler. That mostly means that if you manipulate such values, the compiler is able to avoid an extra allocation on the projection/injection of these values from/to thebigarray
. Some calculation can becomes pretty fast instead of aint8 Bigarray
with{get,set}_int{32,64}
functions to be able to manipulate these values serialized into a certain form (endianness) - small
bytes
(less thatMax_young_wosize = 256
) are allocated on the minor heap which consists to “just” prepare a new block and shift the pointer of the stop-and-copy minor heap (which is pretty fast) - You can take the advantage of
Bytes.unsafe_{of,to}_string
to manipulatestring
(and avoid an illegalset
via the type system) for free when, on the runtime,string
andbytes
have the same representation - if you want to
mmap
, you must use abigarray
- If you want to manipulate a shared buffer between multiple processes, you must use a
bigarray
- again, due to the fact that the GC will never move the buffer. This is what I try to do on my side aboutrowex
, a small persistent index.
I think some others particularities exists but again, it really depends on what you want to do. For instance:
-
decompress
(an implementation ofzlib
) usesbigarray
because it’s fair to assume that the input buffer and the ouput buffer will have a looong life.
- on the opposite,
digestif
uses both types when it can be interesting to take the advantage about the GC lock (and the ability to release it) and it still is interesting to digest a simplestring
or small objects (in general). - I just start a draft to use
bytes
instead ofcstruct
/bigarray
inmirage-crypto
when I started to check the memory consumption of it which can put a huge pressure on the GC due the allocation viamalloc
of small objects (2 or 4 bytes). - Obviously, a library such as
parmap
must usebigarray
as a shared buffer between processes and do a true parallelism.
Some questions can appear so from all of that:
- can we functorize the code over a common interface between
bytes
andbigarray
- can we use GADT to specialize some branches according to these values
- should we just be arbitrary on our choice?
I would like to say that, from my experiments, OCaml is not really able to really specialize an implementation which uses a 'buffer
via functors or GADTs. I know that you should have a better chance with flambda
which is more aggressive than OCaml vanilla. But from my experience, it’s not a reluctant adoption point if you arbitrary choose bytes
or bigarray
as long as it is consistent with your usage - and this is where it becomes complex to fully describe what you need .
But in anyway, it’s hard to have the best of both worlds into the same type. Many of these particularities are exclusive due to the underlying design of the caml runtime. So I continue to say that it depends .