`bytes` vs `char array`

To me it seems that bytes is basically char array, but why bother having such a primitive when we could in theory implement a Byte module with an abstract type t = char array? Were there historical reasons or am I missing something else?

Thanks!

You are missing the memory representation !

'a array is a polymorphic type. This means that each array slot has a pointer to the actual 'a value.

So a char array is an array filled with pointers to individual characters. That’s neither efficient, nor what most system IO APIs expect.

A bytes value is a contiguous sequence of bytes in memory.

In ascii-art terms for the string “abc” (not exactly see my correction below):

+---+---+---+
| . | . | . |
+-|-+-|-+-|-+
  v   v   v
  a   b   c

vs

+---+---+---+
| a | b | c |
+---+---+---+
1 Like

Thanks. So it’s purely for performance reason?

Also, why couldn’t array store a bunch of value directly? Based on my understanding of the runtime, if the value is an integer or char, we just shift it to get the actual value, else if it’s a pointer, we de-ref it to get the actual value, etc. But my understanding of the internals is very rudimentary, so please correct me if I’m wrong!

References:

And interoperability reasons.

Note that actually what I wrote above is slightly wrong since char are effectively represented by integers and integers are unboxed in OCaml so what you have in the case of char array is:

 +------+------+------+
 | ...a | ...b | ...c |
 +------+------+------+

But the size of the cells of the array is the word size of your machine (i.e. enough to be able to hold a pointer), so you still don’t have the packed representation expected by a C array of bytes.

4 Likes

Ah, that is a good point.

Would you mind elaborating a little on the “interoperability” part? Or if you could point me to sources, I’m down to dig it up on my own.

Thanks again!

Note that with a char array you end up wasting 7 bytes per byte on a 64-bit machine, so that becomes quickly costly.

Regarding interoperability. Suppose you want to call the C write(2) function.

The function takes a buffer b to read from and a number n of bytes to read. But the function reads n contiguous bytes from b, so if you give it a char array it will read the wasted bytes mentioned above which is not what you want.

2 Likes

I see now. Thank you so much for answering all my questions!

1 Like

It’s some design decisions that the designer of OCaml language has made:

OCaml does not have very strong support for overloading. Neither the bytecode compiler attempts to generate specialized code code when the function is polymorphic (although the native compiler could do so on some cases for optimization), nor the Stdlib.Array written in a fashion that more effective memory representation is chosen at runtime.

The former is an analogy to Haskell’s typeclass or C++’s template, and the later is what dynamic typed or OOP based languages such as Common Lisp could do. They can have the “packed char array” or even bitarray rather easily while maintaining a unified interface.

Actually, it is, since the standard library already supports packed float arrays. (I am not discussing whether this was a good or bad decision, just that it has been implemented for a long time.)

There is a bit of a technical difficulty when it comes to packed char arrays though. Indeed, packed float arrays rely on the fact that float values are boxed and thus their dynamic type is known at runtime. Since char values are not boxed, the runtime would have no way to know whether Array.make is supposed to create a packed char array or a generic value array. Other Array functions (which receive already created arrays) do not have this issue and would work fine with packed char arrays, were they implemented.