[ANN] First release of baby

It is my pleasure to announce the first release of baby.

baby is an OCaml library that offers several implementations of balanced binary search trees. At this time, baby offers a replacement for OCaml’s Set module; it does not yet have a replacement for OCaml’s Map module.

Height-balanced and weight-balanced binary search trees are offered out of the box. Furthermore, to advanced users, the library offers a lightweight way of implementing other balancing strategies.

The following points offer a comparison between baby and OCaml’s Set library.

Better Performance

At the time of writing, baby offers generally better performance than OCaml’s Set library. Its operations are generally faster (sometimes much faster; sometimes slightly faster; sometimes slightly slower) than those of the Set library, and its memory allocation rate is slightly lower.

Constant-Time Cardinal

In contrast with the Set library, baby’s weight-balanced trees offer a cardinal function whose time complexity is O(1). They also offer a family of random access functions (get, index, etc.) whose time complexity is O(log n). Furthermore, by exploiting cardinality information, the functions subset and equal are sometimes able to return false in constant time.

Better Sharing

baby’s binary operations (union, inter, diff) take advantage of (and preserve) physical equality in a more aggressive way. This allows them to (sometimes) be faster and allocate less memory.

Adaptive Conversions To Sets

baby’s conversion functions of_list, of_array, and of_seq have adaptive complexity. If the input data is sorted, their complexity is O(n); otherwise, their complexity gracefully degrades down to O(n.log n).

More Operations

baby offers a few operations that do not exist in OCaml’s Set library:

  • The symmetric difference, xor;
  • The conversion functions of_array and to_array;
  • The extremum-removal functions remove_min_elt and remove_max_elt;
  • The enumeration API in the submodule Enum. Enumerations should be slightly faster than standard sequences, and are able to efficiently seek ahead, via the function from.

Documented Complexity

In baby, the time complexity of every operation is documented.

Compatibility

baby is perfectly compatible with OCaml’s Set library. In other words, using Baby.W.Set instead of Set is safe.

As a word of warning, though, if the equivalence relation on elements is coarser than equality (that is, if compare x y = 0 does not imply x = y), then Baby.W.Set and Set might behave differently when a choice must be made between two equivalent elements. This can occur in union, of_list, of_array, of_seq, add_seq, map.

23 Likes

I was confused by this sentence. After thinking about it, I think it that x = y must be understood as mathematical equality, not the OCaml polymorphic structural equality operator. With this interpretation, it makes sense, and the warning is that Set and Baby.W.Set may not pick the same element when adding several elements that compare to 0. Maybe using x == y instead of x = y in the warning would clarify the meaning ?

I also saw this comment at the start of Baby.cppo.ml:

(* Unfortunately, the OCaml compiler is pretty bad at optimization. In my
   experience, although it does usually inline functions when requested, it
   does not subsequently perform the simplifications that one might naturally
   expect. In particular, it does not simplify match-of-match, and cannot even
   simplify match-of-constructor. *)

Have you compared the results with and without flambda ? At least simplification after inlining should work with flambda, as well as match-of-constructor. Hopefully flambda2 will also support match-of-match soon, but I don’t think any of the current compilers do.

1 Like

Hello Vincent – I just came across your reply.

Thanks for pointing out my ambiguous use of the equality symbol in the documentation. I have improved the sentence (I think) by writing: if the equivalence relation on elements is coarser than equality (that is, if compare x y = 0 does not imply that x and y are indistinguishable), then Baby.W.Set and Set might behave differently.

Regarding flambda, yes, I have tried it, and it is generally better than the mainstream compiler, I think. But flambda is not the default compiler, so I am still taking steps to ensure that I get optimal performance even with the default compiler. (This can be ugly… I have been using some macros.)

But where is it? I was expecting a link to the implementation.

Also at baby 20240619 (latest) · OCaml Package

Was it motivated by this discussion about the performance of official containers? :sweat_smile:

Actually, no, it wasn’t; I was not aware of this discussion. Thanks for pointing me to it!

1 Like

Just use opam install baby. And the documentation is online.

Does this support elements where they are ordered by some part of the element but an element can hold additional information that can be retrieved knowing only the ordered part? Essentially a key/value store with set operations? The OCaml Set has a find operation for this purpose but it is a bit awkward.

You are correct, find in OCaml’s Set library feels a bit awkward, because it is the only operation that is meant to be used in a setting where compare x y = 0 does not imply that x and y are equal. The description of every other operation in the library assumes that compare x y = 0 implies that x and y are indistinguishable. For example, the documentation of union does not document what happens if one set contains x, the other set contains y, and x and y are equivalent but not equal.

baby implements the Set API, so it inherits this weirdness.

In the future, I would like baby to offer both the Set API and the Map API, without internal code duplication (!), so I will perhaps be forced to think more closely about this issue.

3 Likes

Here is a general remark which applies both to baby and Stdlib.Set. Consider the trivial comparison:

module Triv = struct type t = int let compare _ _ = 0 end

and we define the “same” set with

module B = W.Set.Make(Triv);;
let s = B.of_list [1;2;3;4;5];;
let u = B.of_list [5;4;3;2;1];;

We have

# B.equal s u;;
- : bool = true

which is expected, however

# B.to_list s;;
- : int list = [1]
# B.to_list u;;
- : int list = [5]

(the same behaviour is obtained with Stdlib) Two remarks

  1. We see that the “same” set can lead to different to_list results, this is “natural” but maybe a word of warning in the documentation of to_list could be welcome.
  2. Quoting from Stdlib: " of_list bs adds the bindings of bs to the empty map, in list order (if a key is bound twice in bs the last one takes over)." Isn’t it the opposite? From the examples I would say “the first one takes over”

We see that the “same” set can lead to different to_list results, this is “natural” but maybe a word of warning in the documentation of to_list could be welcome.

They’re the same list though:

# List.equal (fun x y -> Triv.compare x y = 0) (B.to_list s) (B.to_list u);;
- : bool = true
1 Like

you don’t specify a compare function when you construct a list, so the name of the function List.equal is misleading in my opinion.

Anyway, this was not really my point, I’m just saying that a word of warning is important here

You’re quoting the documentation of Map but comparing it with the implementation in Set. The Set documentation does not specify which one gets picked in case of collision, and the Map version correctly implements its documentation as far as I can tell.

oops, sorry for the noise then, my bad

Hello! You are right to point out that the behavior of these libraries can be confusing/surprising when compare is a preorder. The documentation of baby explicitly indicates that only the case where compare is a total order is documented.

In the following, we usually document the behavior of each operation only in the common case where the relation ≤ is a total order.

Outside of this case, it is up to the reader to guess/imagine what will happen. It was just too painful to write the documentation of the general case…

I don’t know if there is any practical application but from the mathematical viewpoint it is tempting to use Sets with non-total orders to construct equivalence classes. For instance a (stupid?) implementation of Z/5Z would be

# module Modulo5 = struct type t = int let compare x y = compare (x mod 5) (y mod 5) end;;
# module M5 = Set.Make(Modulo5);;
# let x = M5.of_list [0;5;1;8];;
# M5.cardinal x;;
- : int = 3
# M5.to_list x;;
- : int list = [0; 1; 8]
# M5.mem 3 x;;
- : bool = true

I didn’t search, but maybe this has already been used for more complicated equivalence relations?

1 Like

It would be nice if the documentation for union and similar mentioned which element is taken in case of equivalent but not indistinguishable keys. (If you don’t want to promise anything specific it would still be helpful to have that written down explicitly)