Adding Dynamic Arrays / Vectors to StdLib

IMO the API is as simple as:

module type DARRAY = sig
  type 'a t

  val make : unit -> 'a t
  val length : 'a t -> int
  val get : 'a t -> int -> 'a
  val set : 'a t -> int -> 'a -> unit
  val append : 'a t -> 'a -> unit
  val to_array : 'a t -> 'a array
end

I’ve used these a lot in other languages and not once wanted removal or shrinking. They are just a polymorphic Buffer. You use them to accumulate an ordered collection and often trim to size upon completion by converting them into an ordinary array. Once I got them I stopped doing the old “accumulate an immutable singly-linked list and finally reverse it” pattern (smell?).

Make does not require an exemplary element of the required type?
Get does not return an option? Then get can raise an exception (uninitialized array element) and should be called get_exn?

I think I never used, neither felt the need for, a resizable array in my entire OCaml programming life.
Either you use a List, or a Map, or a Hashtbl.
And if you need performance, you transform that damn list (whose size is then known) to an array
before doing array-like accesses.
But I like keeping things simple, more brainy people like more complex things…

1 Like

If persistent vectors allow to have better performance (even if only for parallel code), I am all for it.

Make doesn’t need an exemplary element of the required type.

Get cannot return an uninitialized array element. Only a valid element or index out of bounds exception. You could call it get_exn but OCaml doesn’t have an Array.get_exn.

Recent previous attempts to add dynarrays to the stdlib were #9122 by @BastouP411 in November 2019 and #11563 by @c-cube in September 2022. In January 2023 I submitted #11882, a rebased version of @c-cube PR with a boxed implementation to make progress on this issue (the previous PR was stuck on the discussion of how to get a multicore-safe unboxed implementation).

The current candidate PR is looking for reviewers: there has been a lot of progress on the PR, but what hasn’t been done yet is to review the implementation in details for correctness.

Would there be volunteers for an implementation review? Basically one just needs to go to the PR patchset, look at dynarray.ml there, compare the implementation with dynarray.mli for the intended specification, and comment on issues you find / ask questions about things that are unclear.

4 Likes

Posting here not to derail the PR since I think the way it currently is is right and what I’m going to propose is add extra stuff which can come later: my initial reaction upon reading that append_seq a (to_seq a) is an infinite loop was to nerd-snipe myself on a topic that I later saw was discussed at length in the PR under the name of re-entrancy, which is, what happens if you modify the length of a vector while you’re iterating on it. I actually investigated a few other implementations:

  • python’s “lists” have the more “dynamic” behavior; the below is an infinite loop:
    for x in arr:
        arr.append(x)
    
  • JS’s for ... of ... loops on arrays are also dynamic, however the forEach function has an intermediate behavior where it iterates up to the initial length, terminating early if the length was reduced, and ignoring length increases.
    let a  = [1, 2];
    a.forEach(x => a.push(x)); // a is now [1, 2, 1, 2]
    a.forEach(x => a.splice(a.length - 1, 1)) // a is now [1, 2]
    
  • Rust has the most static behavior possible in that if you attempt to modify the vector inside the iteration it will fail at compile time, because the iteration has to borrow the vector
  • In C++ the spec is that the iterators behave like pointers, so that they become invalid if the vector is reallocated, but within a given capacity the behavior of a for (auto it = arr.begin(); it != arr.end(); it++) loop is the dynamic one. From my experiments, you get segfaults if you do not reserve capacity, no effort is made to detect iterator invalidation. Consistent with this specification and the fact that it computes arr.end() only once, the for_each function has an intermediate behavior where it tolerates and ignores length increases as long as the capacity was reserved, and never iterates past the initial length; unlike JS forEach it will fail on length decreases, not gracefully so. Finally, the for (int x: arr) loop is equivalent to for_each; interestingly, if I’m reading it right, it used to be equivalent to the old-form for-loop, so that it had the dynamic behavior, and they deliberately changed it to the more useful intermediate behavior.
    vector<int> arr {1, 2};
    arr.reserve(4); // without this line the next loop produces {1, 2, 1, 0} (!!)
    // arr is {1, 2, 1, 2} after this loop
    for (int x: arr)
        arr.push_back(x);
    // this loop seg-faults
    for (int x: arr)
        arr.erase(arr.end());
    
  • CCVector’s documentation does not specify the behavior. From my experiments CCVector.iter is fully static and does not see either length increases or length decreases (though it does not fail either). I think it was fixed in relation to this PR because when I tried it last week it would seg-fault. Meanwhile CCVector.to_seq is fully dynamic.
    let open CCVector in
    let arr = of_list [1; 2] in
    iter (push arr) arr; (* arr is now [1; 2; 1; 2] *)
    iter (fun x -> ignore (pop arr)) arr (* arr is now [] *)
    

So all in all we have:

  • Fully static (choice (1) in the PR / Rust): fail on length changes
  • Fully dynamic (choice (3) in the PR / to_seq in the PR / for ... in ...: in python / for (... of ...) in JS / for(...; ...; ...) in C++ if you pre-allocate): track length changes, iterate up to current length
  • Intermediate choices:
    • Ignore length changes and just iterate on the initial array up to the initial length, at the risk of seeing an obsolete array within the iterator (choice (2) in the PR / CCVector.iter / C++ for_each and for (... : ...))
    • Track the length and iterate up to the initial length or the current length, whichever comes first (JS forEach)

I tend to think the more “static” behavior the PR ended up with for the iterating functions (just fail if you detect the length has changed) is a good one since the other choices are footguns. The one thing that is unfortunate is that with to_seq being dynamic, there is no implementation of any intermediate behaviors, which would be useful in cases where you want to modify the array while iterating on it without having to use indices and writing bound checks. I wonder if one of the following functions would not be helpful:

(* like CCVector.iter *)
let to_seq_slice_1 ?(start = 0) ?end_ { length; arr } =
  let end_ = match end_ with Some n -> n | None -> length - 1 in
  let rec loop i () =
    if i > end_ then Seq.Nil
    else Seq.Cons (arr.(i), loop (i + 1)) (* you can probably use unsafe_get here *)
    in
  loop start
(* like JS forEach *)
let to_seq_slice_2 ?(start = 0) ?end_ dyn_arr =
  let initial_end = match end_ with Some n -> n | None -> dyn_arr.length - 1 in
  let rec loop i () =
    if i > initial_end || i >= dyn_arr.length then Seq.Nil
    else Seq.Cons (dyn_arr.arr.(i), loop (i + 1))
    in
  loop start

Then you could append the vector to itself with the following code, without copying the array nor writing any bound checks:

append_seq arr (to_seq_slice arr)

Edit: the more I think of it, the more I’m convinced this is too clever actually. Just use indices or get a copy of the backup if you really need this behavior.

2 Likes

As someone who was also nerd-sniped by this question, I recognize the crave and I thank you for the comparison with other programming languages, many of those I had not checked and the comparison is interesting.

I think that you can actually do something related to this to_seq_slice construction you have in mind with Dynarray.to_seq a |> Seq.take (Dynarray.length a). This sequence will stop early if elements are deleted, but not grow if elements are added. (One can of course use Dynarray.to_array a for this, but I supposed you wanted to avoid the intermediate copy.)

1 Like

This is a good topic, in fact I am not convinced by the difference between to_seq and other iterators.

val to_seq : 'a t -> 'a Seq.t
(** [to_seq a] is the sequence of elements
    [get a 0], [get a 1]... [get a (length a - 1)].

    Because sequences are computed on-demand, we have to assume that
    the array may be modified in the meantime, and give a precise
    specification for this case below.

Especially because the data flow is less obvious with sequences, I would expect an exception (=programming error) if the structure of the dynarray changes during iteration.

I could consider providing both versions if we can find good names for both.

My idea is more about discouraging the footgunny one than letting users choose. What usages do you have in mind for the current specification?

I fixed the CCVector.to_seq behavior to match to_iter. It’s easier and cleaner to capture the current length and array initially anyway.

Ah, smart. I always forget that defining a non-terminating sequence is not necessarily a big deal, you don’t have to read it all.

Ok that makes sense, I did some tests a few days back with some old version, and again today with the latest version, but I didn’t double check to_seq, only iter.

I am not sure where you are reading that, but the behavior has always been static, even in the original proposal from 2005: “Notice that the end iterator is only calculated once.” (Proposal for new for-loop)

That is just a lucky accident of existing implementations. The C++ standard is quite clear that this is undefined behavior: “If no reallocation happens, then references, pointers, and iterators before the insertion point remain valid but those at or after the insertion point, including the past-the-end iterator, are invalidated.” You cannot keep using the original value of arr.end(), even if the capacity was reserved.

Hi everyone,

A short note: we eventually manage to build consensus on https://github.com/ocaml/ocaml/pull/11882 and I merged it yesterday, so there will be a Dynarray module in the standard library in OCaml 5.2. Thanks @BastouP411 and @c-cube for their efforts bringing us to this point, and to the many reviewers of the latest iteration of the PR (in particular @314eter, @c-cube, @clef-men, @damiendoligez, @dbuenzli, @gadmm, @Octachron and @wiktor).

The implementation uses an indirection to get a type-safe “empty” value to use in unused elements of the support array. Hopefully we can improve on that with a safe magical implementation in the short future – but I think that the performance is good enough already.

24 Likes

Just want to say thanks everyone for the hard work on getting this merged :tada:. Sometimes the “obvious” improvements to a language are the hardest to get consensus on, but I’m glad this got done!

1 Like

Hopefully we can improve on that with a safe magical implementation in the short future – but I think that the performance is good enough already.

I remember the second half being discussed in the thread and I do not think that this conclusion held up. In particular it is very difficult to have an experiment that is relevant wrt cache-locality effects using micro-benchmarks.

2 Likes

We did discuss this, and I ran measurements on the effect of boxing/indirection on a real-world dynarray-intensive program: https://github.com/ocaml/ocaml/pull/11882#issuecomment-1405867624 .

5 Likes

This was a good experiment. Thank you for your prolonged efforts and for pushing this to the finish line!

6 Likes

Excited to experiment with it! Thank you all for your hard work :slight_smile:

I do enjoy the irony that one of the main reason the very first proposal (the one that led to the creation of this discuss thread) was rejected was because of the indirection. (I know that the new decision makes sense for different reasons (OCaml 5 etc), I just find it funny)