Hey !
I recently raised PR to add dynamic arrays to the standard library.
However I’ve been told that it would be better to discuss it first ( apart from the different problems in my implemetation).
So I wonder what do other people think about it and what would be the best way to implement it.
As far as I know, we have at least two “battle tested” implementations of vectors around: the one from Batteries and the one from Containers (I could not find one in base or core_kernel). There is also the code from Jean-Christophe Filliâtre: vector.mlivector.ml
I think it would be good to start by reviewing the interfaces of all these vector APIs and implementations, what they offer and how much they differ. Then we could discuss which code we want to import in the compiler.
As a user, I consider it part of a dynamic array contract that no time is ever wasted initializing unused array slots (beside what’s mandated by the OS/libc). Therefore dynamic arrays are impossible to implement in pure safe OCaml.
One would have to resort to C or abusing the Obj module, which is more portable I guess.
I would therefore strongly recommend to take inspiration from batteries batDynArray module.
Personally I’m not completely convinced by the suggestion to use unsafe features for performance. I think that there would also be value in a fairly straightforward implementation, that is not going to require too much maintenance as the compiler and optimizers evolve, or when compiled with an exotic backend, etc. I always felt that the Batteries implementation, which was inherited wholesale from Extlib, was too complex.
An implementation that doesn’t require boxing for non-defined values is one that retains the initialization-time elements (assuming all dynarray-creation functions require an element value) and uses it to fill missing elements. This has the downside that the lifetime of this element is extended to the lifetime of the value (it will not be collected even if all its occurrences in the storage array are removed), but this can be made reasonably clear/expected with good labelled-argument names in the API.
The other thing to watch for is bad worst-case behaviors due to “stuttering” if the resizing logic is not correct: if you double the size when the array gets large (we need at least one more element) and halve the size when it gets small (half the elements are unused), you run a risk that at a certain point you can double and halve repeatedly by adding or removing just one element. There are various ways to go around it (typically “halving” only when three-quarters of the elements are unused), but it’s important to avoid this complexity bug. (The PR you submitted doesn’t have any shrinking logic, which might be reasonable but is a choice that has to be discussed in details.)
The implementation of Jean-Christophe Filliatre pointed by Armaël has these two good properties, and generally a pretty nice implementation. I think it would be a good starting point for a stdlib proposal – it would need to be extended with at least conversion functions to and from the Seq module for consistency.
Understandable opinion, yet in this particular instance I do not find that this implementation have been such a burden: since the inherited implementation in 2010, there has been only 5 fixes and no refactoring at all.
From what I can see in this log, the API of the Obj module have been surprisingly constant and reliable,
quite counter intuitively.
Regarding shrinking, my feeling (but that’s maybe just a habit) is that the user must have a say about when to (or not to) resize; in particular, when to resize down to the exact length, as most often than not she knows when the array reached its “cruising” size.
That’s also the C++ API for vectors. Of course the prevalence of physical equality makes this level of control even more important in C++ than in OCaml, but OCaml have physical equality too and some may want to make use of it, which is close to impossible if there is no control at all on when the array is going to be resized.
Even if there is some “smart” shrinking happening under the hood, it’s important that the user can force a resize in any cases (when you know your huge array has stopped growing you really want the extra potentially significantly large extra room to go away).
Again, if not for performance there is close to no use for resizeable arrays. Any elegant map would do.
Scratch that, I was thinking about unboxed objects (too much C++ these days), which should not be a concern indeed (but maybe for resizeable float arrays).
We also have STL-like vectors in BAP. Here is the documentation and the implementation. The implementation is pretty straightforward in the vain of the Buffer.t module, with only difference that we need to have a default value.
Did this go anywhere ?
I really think not having this in OCaml is embarrassing, dynamic arrays is a feature that is available in virtually every language.
Besides, I think it should live in the stdlib, because I believe it’s impossible to write a good vector implementation without a bit of unsafeObj to manage the uninitialized part of the array. (I don’t like the requirement to provide a dummy value, it’s a strictly inferior API that is found in no other language (that I know of)).
Arrays in OCaml are already magical; why not dynamic arrays? Even in rust they’re somewhat magical as they concentrate a lot of unsafe blocks. People modifying the compiler and breaking the (tiny) amount of required Obj can update the implementation as they go.
FWIW, I agree that an API that doesn’t require providing a dummy value would be very valuable. I wouldn’t mind if initially it was backed by an 'a option array, and maybe later optimized using a solution that’s more costly to maintain.