Adding Dynamic Arrays / Vectors to StdLib

I still definitely agree that having Dynarrays to work with would be good. Even if it has a little lack of efficiency for now. It could definitely be improved.

I find it odd that with multicore around the corner people are still looking for a mutable data structure. I think that persistent vectors along the lines of RRBs would be a much better tool to add to the box.

I once started an implementation here but never got around to finish it. My motivation was mainly to have a nice data structure to support Unicode text (see Utext) as vector of Uchar.t values with a good uniform data structure to move across the different segmentation of text you might need: sequences of scalar values, sequences of grapheme clusters, sequences of lines, sequences of paragraphs and any combination thereof by piling the vectors on top of each other.

6 Likes

I don’t think anyone’s particularly looking for a mutable data structure. What I see is a bunch of algorithms out there assuming extensible arrays. Being able to translate them to ocaml effortlessly and moving on would be the main goal here.

4 Likes

A dummy value could be provided by the module if it is acceptable to use a function instead of simple polymorphism.

And yes, I agree that backing it by an 'a option array is a good idea, as the implementation will take 5 minutes.
It’s possible to sit back an come up with a perfectly optimised implementation after the flawed one is up and running

Mutable data structure are a core part of OCaml IMO
And I think a dynamic array is such a simple and ubiquitous feature that it needs to be in the std lib
Your data structure certainly looks interesting though, thank for the share

1 Like

The standard library is mostly append only. Starting with a bad implementation incurs the risk of getting stuck with the bad implementation for a very long time.

4 Likes

What is the status of this implementation?

If we want dynamic arrays in the stdlib, I think that the work to be done has been clearly outlined: someone needs to do the work of writing a document that:

  • collects pointers to existing resizable-array implementations out there (I believe several have already been mentioned in this thread),
  • compares the APIs and discuss the important differences
  • presents the choices (on the interesting differences) that are proposed for the stdlib
  • (possibly as a second step) proposes an implementation, possibly by reusing code from licence-compatible existing implementations

Before an implementation is written, the authors of this document should feel free to seek feedback on the proposed design, here or in the original PR.

Doing this takes work, but it would also be useful.

4 Likes

Probably a proper place for this then is the RFCs repository first.

1 Like

I expect multicore to make mutable data structures more important, not less.

FWIW, the main problem with the straightforward implementation is that it violates the generational hypothesis so the GC ends up doing ~3x more work than necessary.

Ideally you don’t want the GC traversing all of the empty slots in the latter half of the array either. Achieving that would require extensible arrays to be promoted into the compiler. IMO they are important enough for that to be worthwhile.

5 Likes

One suggestion has been to fill in empty slots with a NULL value which is skipped during marking like an immediate, and this is trivial to implement in the OCaml runtime. In this case, marking is done with a loop with a well-predicted comparison and an increment, which is something that processors run very quickly. The number of uninitialized values is a Θ of the number initialized ones (assuming that shrinking is implemented), so the time spent marking uninitialized slots is negligible. I do not know if this would be a premature optimisation, but anything more involved in terms of runtime support probably is. (Replying to the grand-parent reply, I am also sceptical that the cost of initializing unused array slots is non-negligible.)

IMO adding support for a special “null” non-value has the best cost/benefit ratio to properly implement dynamic arrays.

(Note: discussion has since been happening at https://github.com/ocaml/ocaml/pull/11563)

1 Like

IMO the API is as simple as:

module type DARRAY = sig
  type 'a t

  val make : unit -> 'a t
  val length : 'a t -> int
  val get : 'a t -> int -> 'a
  val set : 'a t -> int -> 'a -> unit
  val append : 'a t -> 'a -> unit
  val to_array : 'a t -> 'a array
end

I’ve used these a lot in other languages and not once wanted removal or shrinking. They are just a polymorphic Buffer. You use them to accumulate an ordered collection and often trim to size upon completion by converting them into an ordinary array. Once I got them I stopped doing the old “accumulate an immutable singly-linked list and finally reverse it” pattern (smell?).

Make does not require an exemplary element of the required type?
Get does not return an option? Then get can raise an exception (uninitialized array element) and should be called get_exn?

I think I never used, neither felt the need for, a resizable array in my entire OCaml programming life.
Either you use a List, or a Map, or a Hashtbl.
And if you need performance, you transform that damn list (whose size is then known) to an array
before doing array-like accesses.
But I like keeping things simple, more brainy people like more complex things…

1 Like

If persistent vectors allow to have better performance (even if only for parallel code), I am all for it.

Make doesn’t need an exemplary element of the required type.

Get cannot return an uninitialized array element. Only a valid element or index out of bounds exception. You could call it get_exn but OCaml doesn’t have an Array.get_exn.

Recent previous attempts to add dynarrays to the stdlib were #9122 by @BastouP411 in November 2019 and #11563 by @c-cube in September 2022. In January 2023 I submitted #11882, a rebased version of @c-cube PR with a boxed implementation to make progress on this issue (the previous PR was stuck on the discussion of how to get a multicore-safe unboxed implementation).

The current candidate PR is looking for reviewers: there has been a lot of progress on the PR, but what hasn’t been done yet is to review the implementation in details for correctness.

Would there be volunteers for an implementation review? Basically one just needs to go to the PR patchset, look at dynarray.ml there, compare the implementation with dynarray.mli for the intended specification, and comment on issues you find / ask questions about things that are unclear.

5 Likes

Posting here not to derail the PR since I think the way it currently is is right and what I’m going to propose is add extra stuff which can come later: my initial reaction upon reading that append_seq a (to_seq a) is an infinite loop was to nerd-snipe myself on a topic that I later saw was discussed at length in the PR under the name of re-entrancy, which is, what happens if you modify the length of a vector while you’re iterating on it. I actually investigated a few other implementations:

  • python’s “lists” have the more “dynamic” behavior; the below is an infinite loop:
    for x in arr:
        arr.append(x)
    
  • JS’s for ... of ... loops on arrays are also dynamic, however the forEach function has an intermediate behavior where it iterates up to the initial length, terminating early if the length was reduced, and ignoring length increases.
    let a  = [1, 2];
    a.forEach(x => a.push(x)); // a is now [1, 2, 1, 2]
    a.forEach(x => a.splice(a.length - 1, 1)) // a is now [1, 2]
    
  • Rust has the most static behavior possible in that if you attempt to modify the vector inside the iteration it will fail at compile time, because the iteration has to borrow the vector
  • In C++ the spec is that the iterators behave like pointers, so that they become invalid if the vector is reallocated, but within a given capacity the behavior of a for (auto it = arr.begin(); it != arr.end(); it++) loop is the dynamic one. From my experiments, you get segfaults if you do not reserve capacity, no effort is made to detect iterator invalidation. Consistent with this specification and the fact that it computes arr.end() only once, the for_each function has an intermediate behavior where it tolerates and ignores length increases as long as the capacity was reserved, and never iterates past the initial length; unlike JS forEach it will fail on length decreases, not gracefully so. Finally, the for (int x: arr) loop is equivalent to for_each; interestingly, if I’m reading it right, it used to be equivalent to the old-form for-loop, so that it had the dynamic behavior, and they deliberately changed it to the more useful intermediate behavior.
    vector<int> arr {1, 2};
    arr.reserve(4); // without this line the next loop produces {1, 2, 1, 0} (!!)
    // arr is {1, 2, 1, 2} after this loop
    for (int x: arr)
        arr.push_back(x);
    // this loop seg-faults
    for (int x: arr)
        arr.erase(arr.end());
    
  • CCVector’s documentation does not specify the behavior. From my experiments CCVector.iter is fully static and does not see either length increases or length decreases (though it does not fail either). I think it was fixed in relation to this PR because when I tried it last week it would seg-fault. Meanwhile CCVector.to_seq is fully dynamic.
    let open CCVector in
    let arr = of_list [1; 2] in
    iter (push arr) arr; (* arr is now [1; 2; 1; 2] *)
    iter (fun x -> ignore (pop arr)) arr (* arr is now [] *)
    

So all in all we have:

  • Fully static (choice (1) in the PR / Rust): fail on length changes
  • Fully dynamic (choice (3) in the PR / to_seq in the PR / for ... in ...: in python / for (... of ...) in JS / for(...; ...; ...) in C++ if you pre-allocate): track length changes, iterate up to current length
  • Intermediate choices:
    • Ignore length changes and just iterate on the initial array up to the initial length, at the risk of seeing an obsolete array within the iterator (choice (2) in the PR / CCVector.iter / C++ for_each and for (... : ...))
    • Track the length and iterate up to the initial length or the current length, whichever comes first (JS forEach)

I tend to think the more “static” behavior the PR ended up with for the iterating functions (just fail if you detect the length has changed) is a good one since the other choices are footguns. The one thing that is unfortunate is that with to_seq being dynamic, there is no implementation of any intermediate behaviors, which would be useful in cases where you want to modify the array while iterating on it without having to use indices and writing bound checks. I wonder if one of the following functions would not be helpful:

(* like CCVector.iter *)
let to_seq_slice_1 ?(start = 0) ?end_ { length; arr } =
  let end_ = match end_ with Some n -> n | None -> length - 1 in
  let rec loop i () =
    if i > end_ then Seq.Nil
    else Seq.Cons (arr.(i), loop (i + 1)) (* you can probably use unsafe_get here *)
    in
  loop start
(* like JS forEach *)
let to_seq_slice_2 ?(start = 0) ?end_ dyn_arr =
  let initial_end = match end_ with Some n -> n | None -> dyn_arr.length - 1 in
  let rec loop i () =
    if i > initial_end || i >= dyn_arr.length then Seq.Nil
    else Seq.Cons (dyn_arr.arr.(i), loop (i + 1))
    in
  loop start

Then you could append the vector to itself with the following code, without copying the array nor writing any bound checks:

append_seq arr (to_seq_slice arr)

Edit: the more I think of it, the more I’m convinced this is too clever actually. Just use indices or get a copy of the backup if you really need this behavior.

2 Likes