Yeah, I saw that, but I wondered if it wasn’t better to have it implemented in base OCaml as I think it is a very useful structure.
I agree that it should be in the standard library. I just think you should pull the code from Containers, with minor modifications as needed.
Do I have the right to do that?
Thanks for reropening the discussion here! I really think this is a better place to discuss design at this stage.
As far as I know, we have at least two “battle tested” implementations of vectors around: the one from Batteries and the one from Containers (I could not find one in base or core_kernel). There is also the code from Jean-Christophe Filliâtre: vector.mli vector.ml
I think it would be good to start by reviewing the interfaces of all these vector APIs and implementations, what they offer and how much they differ. Then we could discuss which code we want to import in the compiler.
Base’s Queue resizes an array under the hood, but it’s not extracted out and has some queue-specific functionality.
I wouldn’t mind, but the API could be reworked and simplified (the phantom type for mutability isn’t really worth it in the end).
As a user, I consider it part of a dynamic array contract that no time is ever wasted initializing unused array slots (beside what’s mandated by the OS/libc). Therefore dynamic arrays are impossible to implement in pure safe OCaml.
One would have to resort to C or abusing the Obj module, which is more portable I guess.
I would therefore strongly recommend to take inspiration from batteries batDynArray module.
Here’s another implementation for comparison https://github.com/mmottl/res
Personally I’m not completely convinced by the suggestion to use unsafe features for performance. I think that there would also be value in a fairly straightforward implementation, that is not going to require too much maintenance as the compiler and optimizers evolve, or when compiled with an exotic backend, etc. I always felt that the Batteries implementation, which was inherited wholesale from Extlib, was too complex.
An implementation that doesn’t require boxing for non-defined values is one that retains the initialization-time elements (assuming all dynarray-creation functions require an element value) and uses it to fill missing elements. This has the downside that the lifetime of this element is extended to the lifetime of the value (it will not be collected even if all its occurrences in the storage array are removed), but this can be made reasonably clear/expected with good labelled-argument names in the API.
The other thing to watch for is bad worst-case behaviors due to “stuttering” if the resizing logic is not correct: if you double the size when the array gets large (we need at least one more element) and halve the size when it gets small (half the elements are unused), you run a risk that at a certain point you can double and halve repeatedly by adding or removing just one element. There are various ways to go around it (typically “halving” only when three-quarters of the elements are unused), but it’s important to avoid this complexity bug. (The PR you submitted doesn’t have any shrinking logic, which might be reasonable but is a choice that has to be discussed in details.)
The implementation of Jean-Christophe Filliatre pointed by Armaël has these two good properties, and generally a pretty nice implementation. I think it would be a good starting point for a stdlib proposal – it would need to be extended with at least conversion functions to and from the
Seq module for consistency.
Understandable opinion, yet in this particular instance I do not find that this implementation have been such a burden: since the inherited implementation in 2010, there has been only 5 fixes and no refactoring at all.
From what I can see in this log, the API of the Obj module have been surprisingly constant and reliable,
quite counter intuitively.
Regarding shrinking, my feeling (but that’s maybe just a habit) is that the user must have a say about when to (or not to) resize; in particular, when to resize down to the exact length, as most often than not she knows when the array reached its “cruising” size.
That’s also the C++ API for vectors. Of course the prevalence of physical equality makes this level of control even more important in C++ than in OCaml, but OCaml have physical equality too and some may want to make use of it, which is close to impossible if there is no control at all on when the array is going to be resized.
Even if there is some “smart” shrinking happening under the hood, it’s important that the user can force a resize in any cases (when you know your huge array has stopped growing you really want the extra potentially significantly large extra room to go away).
Again, if not for performance there is close to no use for resizeable arrays. Any elegant map would do.
compact : 'a t -> unit (sets the storage size to the current element count) would be a reasonable thing to add.
(I don’t understand your remark about physical equality; resizing does not change the physiqual-equality relations in the program?)
Scratch that, I was thinking about unboxed objects (too much C++ these days), which should not be a concern indeed (but maybe for resizeable float arrays).
We also have STL-like vectors in BAP. Here is the documentation and the implementation. The implementation is pretty straightforward in the vain of the Buffer.t module, with only difference that we need to have a default value.
Did this go anywhere ?
I really think not having this in OCaml is embarrassing, dynamic arrays is a feature that is available in virtually every language.
I agree, it’d be nice to have that in the stdlib.
Besides, I think it should live in the stdlib, because I believe it’s impossible to write a good vector implementation without a bit of
Obj to manage the uninitialized part of the array. (I don’t like the requirement to provide a dummy value, it’s a strictly inferior API that is found in no other language (that I know of)).
Arrays in OCaml are already magical; why not dynamic arrays? Even in rust they’re somewhat magical as they concentrate a lot of
unsafe blocks. People modifying the compiler and breaking the (tiny) amount of required
Obj can update the implementation as they go.
FWIW, I agree that an API that doesn’t require providing a dummy value would be very valuable. I wouldn’t mind if initially it was backed by an
'a option array, and maybe later optimized using a solution that’s more costly to maintain.
I still definitely agree that having Dynarrays to work with would be good. Even if it has a little lack of efficiency for now. It could definitely be improved.
I find it odd that with multicore around the corner people are still looking for a mutable data structure. I think that persistent vectors along the lines of RRBs would be a much better tool to add to the box.
I once started an implementation here but never got around to finish it. My motivation was mainly to have a nice data structure to support Unicode text (see Utext) as vector of
Uchar.t values with a good uniform data structure to move across the different segmentation of text you might need: sequences of scalar values, sequences of grapheme clusters, sequences of lines, sequences of paragraphs and any combination thereof by piling the vectors on top of each other.