Standardized mutable data structures

I like OCaml because it is a pragmatic language that encourages a functional style, but doesn’t disallow mutability, which is necessary to implement many algorithms and data structures. However, OCaml’s standard library is centered around functional programming (for example, it has Lisp/ML-style cons linked lists, but not mutable doubly linked lists). I don’t think this is necessarily a bad thing, but sometimes, when I need to reach for a mutable data structure, I’m not sure where to go.

Recently, I needed a doubly linked list that specifically allowed for O(1) insertion and deletion given a reference to a node. I found opam - lwt-dllist, which supported removing nodes but not inserting a new node in front of or behind a given node. I also found Batteries user guide : BatDllist, which was a circular linked list and didn’t support empty lists. It seems to treat nodes as the list itself, with no “owner” of the nodes. I ended up rolling my own doubly linked lists.

Also consider dynamic arrays. opam - bheap rolls its own dynamic array: bheap/binary_heap.ml at 46f0f779d8acaf51fee488d350ffb490c7cec79b · backtracking/bheap · GitHub. Meanwhile, there is a standalone dynamic array package opam - vector, which requires the user to provide a dummy value, opam - vec, which uses Obj.magic, and batteries-included/batDynArray.mlv at master · ocaml-batteries-team/batteries-included · GitHub, which uses Obj.magic. There are different dynamic array libraries with their own pros and cons, and additionally, if I needed to use bheap and vec, I would have two implementations of dynamic arrays in my program, one internal to bheap.

If I just need a single data structure such as a doubly linked list or a dynamic array, I would also feel reluctant to pull in an entire standard library “extension” such as Batteries or Containers.

I wonder if other people experienced the need to use a common mutable data structure that wasn’t in the standard library and didn’t have one library that the majority of the ecosystem “agreed” upon? Did you have to choose between several libraries? Did you ultimately roll your own? How people would feel if the standard library included more common imperative data structures?

(As a side note, if people ever agree on one dynamic array library, I hope people don’t call it “vector.” The designer of the C++ vector library chose the name because it resembled “vectors” in Lisp, but this was a design mistake and a distortion of the mathematical notion of vector: stl - Why is a C++ Vector called a Vector? - Stack Overflow)

3 Likes

Two thoughts:

  1. haha, too late, it’s vector in OCaml, and vec in Rust. Soo late, soooo late. But yeah, I agree with you.

  2. Surely there must be a way to have a library of functions/classes, where you only link the bits you actually use? In C/C++, that’s what “.a” files give you. It seems like ocaml’s “.cma” doesn’t give you that. Maybe I’ve misunderstood it: I never actually inquired into what the format supported, whereas with “.a” files, it was so plainly obvious …

According to the Containers docs you can use individual modules without using the entire library by just copying the modules to your project.
Here is its vector module for example.

1 Like

It could be something else for OCaml, there’s nothing standard yet.
There might be a Dyn_array at some point :wink:

When you link a .cma (or .cmxa) the OCaml linker only extracts the units you actually need. So you can have a big library of independent modules, and if you only use one module your binary will only include the code for that module. But unllike C, the OCaml linker can’t separate individual parts of a compilation unit, so if you use a module you will get all the code of that module even if you only use one function.

2 Likes

What are you referring to? Are you saying that OCaml might add a Dyn_array module? (I couldn’t find a relevant PR.) Also, with OCaml 5 around the corner, do you think the OCaml team would be too busy to deal with stdlib PRs right now?

Or are you referring to renaming CVector in containers?

Ah. Then I guess the reason you get this effect of “include Base/Core and your program bloats gynormously” is that they’re packing modules together (using for-pack) before putting them into the “.cma” ?

Neither Core nor Base are using for-pack since OCaml 4.02 precisely for that reason: Jane Street Tech Blog - Better namespaces through module aliases . In other words, program size bloat with core has not been an issue for eight years (and base was released after the switch to module aliases and never had the issue).

1 Like

Heh, I’ll have to give it a try again, then. The reason I switched away from Core and Core_kernel was specifically the bloated size of executables. Thank you for pointing this out!

If the behaviour is more important and the implementation is not set in stone, then perhaps you can use a Hashtbl.t? It’s in the standard library, is mutable, and should have effectively O(1) insertion and removal.

Should also work for dynamic arrays e.g. type 'a dyn_array = (int, 'a) Hashtbl.t. If you want to prevent sparse arrays you can even create a custom wrapper module that enforces the invariant.

I’m not a fan of this. Hashtbls are effectively O(1), but that depends on whether the elements go into the same bucket, a big asterisk. I also need to keep a mutable counter for generating fresh IDs, possibly “garbage collecting” unreferenced elements, and more. Using a Hashtbl as a linked list counts as “rolling my own linked list,” except it’s even clunkier and more effort.

This reminds me of when Rust people store graphs in a big vector and index by ID or do some other workaround because of unclear ownership, or when Haskell people use an int map because of purity. I am using OCaml because its garbage collection and imperative features allow me to express common mutable data structures such as linked lists. I agree with the sentiment towards programming languages and ability to implement data structures expressed in this discussion: Typing imperative algorithms precisely, especially graph algorithms : ProgrammingLanguages.

1 Like

Based on what @vlaviron and @octachron wrote, I’m wondering: since there’s not really a size problem with pulling in Batteries/etc, why not just do it? My original reason was solely about size, but if that’s not an issue, what is ?

Just curious. To be clear, I haven’t used Core in ages b/c of size, so if that problem is solved … it’ll be a good day. I did like some of the things I got from Core, and found copying code for those few cases to be laborious and a PITA overall. Be nice not to have to do that anymore.

The following issues are not resolved by the proposed solutions:

  • Code duplication: If I either roll my own linked list or vendor a module, and another library does the same, then the final program has two redundant implementations, both of which are used and cannot be eliminated. Even if size is not a problem in practice, this is ugly IMO.
  • Interfaces: If the data structure becomes part of the API, then clients are forced to buy into the same data structure library. This could result in needing to convert between two implementations of the same data structure because two different libraries use different ones. In practice, this leads to one set of packages buying into one ecosystem and another set of packages buying into another, such as Lwt vs Async. Imagine how bad things would be if OCaml didn’t have a cons list type, so multiple libraries defined their own!

As a side note, I wonder if OCaml could have a promises signature not tied to a specific runtime, similar to what Rust does, so that Lwt and Async become “runtimes” that the user only chooses when the async code needs to be waited for a final result?

OK, so before I ask my question, I’ll note again that I avoided using these big libraries, b/c of size/bloat problems, and also (which I didn’t say before) because I found their interfaces to be … substandard. So, that said, it sounds like:

your real problem with these various extended-library implementations of various utility datastructures, is that

  1. they’re either not adequate/adapted for your needs

  2. or that there’s more than one, and that leads to confusion

#1 I agree with, and … well, there’s not a lot one can do other than try to improve things. But #2 … well, isn’t the solution there to just pick one, and hope others will do the same? I mean, it’s not exactly possible to force everybody else to use the one you use. Your example of Rust is instructive: there are a ton of libraries outside of the core, and hence lots of possibility for what one might call needless and confusing duplication. But somehow the ecosystem keeps going and improving.

I do take your point, that it would be good if implementors of variants of some common concept (e…g. promises) would agree upon standard interfaces, so they could pursue their different implementations without imposing extra burden on users. But even for your example, my memory is that Lwt & Async are at some level very diifferent? That Async has provision for errors built-in, where Lwt doesn’t ? I’ve forgotten – there was a nice post comparing the two a while back – so maybe this part I’m just having a brain-fart.

ETA: for your second point (about if we expose (e.g.) Containers.Deque in the interface of our library, we force clients to use that abstraction), I don’t see how that’s any better if I instead write my own Deque. Then I’m just adding to the diversity of implementations, aren’t I? And it’s actually worse, b/c whereas before, let’s say there was Core.Deque’ and Containers.Deque, and I use one and a consuming library preferred the other, now if I do my own MyDeque, that consuming library has it even worse, b/c why would they even know about MyDeque?

Again, I’m not saying that this is great. And as I wrote in my preface, I’ve avoided these large libraries in the past for reasons somewhat connected with yours. But … well, if size isn’t an issue, it seems like we’re down to “we need to convince the authors to standardize more”. And that’s painful work, but it seems to be the same sort of work, regardless of which programming language one chooses.

Hopefully with effect handlers such fragmentation can be avoided if there is an interface that is simple enough for multiple implementations to agree on (and that interface ends up somewhere common, like the stdlib).
Although that isn’t the case yet, there was an interesting paper/talk presented about this in the recent ICFP 2022 OCaml workshop: Composing Schedulers using Effect Handlers (OCaml 2022 - OCaml Users and Developers Workshop 2022) - ICFP 2022

Another interesting approach for compatibility (that works only in limited cases) is to use polymorphic variants, e.g. there is a library providing JSON schemas ocplib-json-typed that works with both yojson and ezjsonm and does so by defining its own polymorphic variant that is exactly compatible with ezjsonm but without depending on it explicitly (and then this polymorphic variant can be passed between the 2 libraries because their types will be compatible).

I don’t think either of these would work easily for generic data structures, those may need to work with regular variants instead of polymorphic ones, although a small portion of a data structure’s API is sort of standardized: iterators, and to/from Seq.t. As long as a data structure provides both of those you can at least convert between 2 data structures in a reasonably efficient manner without either data structure being aware of the other (i.e. they could be implemented in 2 different libraries).

Early on I was faced with a paradox of choice quite often, but usually it is better to just pick one implementation (through various criterias: e.g. which one has better documentation, which one is actively maintained, which one has more users, which one has better performance/security characteristics, which one has fewer dependencies etc.).
If you’re worried you are going to pick wrong and might have to switch in the future then a lightweight wrapper in your project might be useful so you can change your mind later on more easily.

This is more difficult with projects like Lwt vs Async, I agree, and for quite a long time I tried to make my code work with both by restricting myself to only using a sort of wrapper common to both (they both look like monads, right?). However that just means I got the lowest common denominator, there was always some feature that was more easily done in the other library, and using a common interface just meant I’d either have to reimplement those or not use them (e.g. how you do concurrent iteration, or throttle concurrency, etc.)

1 Like

For point #2, the solution I’m getting at isn’t to roll my own data structure, it’s for the standard library to have an implementation of data structure as a “blessed” version that everyone agrees upon.

I agree with you; I’m mostly just noting that to get there, you’d have to convince various groups to agree on what that implementation would be. And that can happen independently of a standard library.

Also … well, I saw how it went in Java, where lots of stuff got inhaled into the standard JDK: it became an intensely political process, and frankly it didn’t produce such great results. The fact that there was one “blessed” solution for many problems meant that others couldn’t do better (or more precisely, even if they did better, they couldn’t get anybody to use their stuff), and that was bad all-around. It became a real roadblock for innovation.

2 Likes

I don’t think I’ve ever found this to be a problem. And in general, I tend to wrap such a data structure in an application-specific interface, that way I can change the underlying structure to my needs without impacting users of the API. I might provide tools in the API to allow someone to convert it to a structure of their use. But at the same time: unless you think what you’re writing is going to become quite popular, don’t worry about it, you’ll probably be the only user for a long time.

This is how I structured my own concurrency library: I have interfaces for the whole thing and then different implementations.

The problem is, concurrency libraries are different because they are different. Mine, for example, supports cancellation. Lwt and Async don’t really support that. And Lwt and Async support incompatible things between them. I already run into incompatibility even in my small world, because my future’s library is an interface I easily implemented it in javascript with a few helper functions. But cancellation works as expected for all work inside my library, but I need to interface with JS libraries sometimes, and the promises they return may or may not support cancellation. I “solve” this with documentation where needed, but obviously in a complex application, you won’t realize it’s an issue until you run into it.

1 Like

Sure, but if the alternative is a linked list, isn’t ‘effectively O(1)’ strictly better than a linked list O(n)?

I also need to keep a mutable counter for generating fresh IDs

Sure, agreed. There are several options for this, including refs and a mutable record field.

possibly “garbage collecting” unreferenced elements

This is exactly what Ephemeron gives you, weak hash tables which automatically garbage-collect unreferences values.

Using a Hashtbl as a linked list counts as “rolling my own linked list,” except it’s even clunkier and more effort.

But on the upside, it’s rolling it with nothing more than what’s already in the standard library, no extra libraries needed.

I am using OCaml because its garbage collection and imperative features allow me to express common mutable data structures such as linked lists

Sure, but is a linked list the best data structure for this job? Your original post just said you ‘needed’ a linked list, but not why it specifically had to be a linked list.

Hash tables are ideally O(1) to retrieve elements, while linked lists are O(n). However, doubly linked lists are O(1) insertion and removal given a pointer to the node. A hash table could have similar characteristics if it used a doubly linked list and exposed a pointer to the inner bucket node to clients, but that begs the questions of having a doubly linked list available in the first place. There are also other crucial differences between a hash table and a linked list, such as ordering of elements.

Using a hash table to achieve the functionality and performance characteristics of a doubly linked list is still the same effort as, or worse effort than, implementing an actual doubly linked list myself. That by using the hash table, I am making use of the standard library, doesn’t matter.

Allow myself to be clearer then. I was implementing an algorithm (lexicographic breadth-first graph traversal) that runs in O(n), provided that I use doubly linked lists to implement ordered partitions of a set so that splitting partitions is O(1). OCaml’s Hashtbl module does not allow me to easily maintain the ordering or have genuine O(1) splitting. Specific algorithms operate on specific data structures, and the performance characteristics of these algorithms depends on using the actual data structure that is called for, not a different data structure as a substitute.

A doubly linked list is a mainstream data structure, and I think it’s reasonable to use it and want it in the standard library. I don’t consider a hash table to be a solution.

2 Likes