[ANN] Release 1.0.0 of bag

backtracking · March 13, 2021, 12:21pm

Clément and myself are pleased to announce the first release of bag, a library of multisets.

SpiceGuid · March 13, 2021, 8:26pm

Is a multiset an inventory ? If no then what’s the difference ? If yes then i can imagine some inventory operation(s) that bag 1.0.0 does not offer.

backtracking · March 14, 2021, 8:21am

A multiset is nothing more than a set where elements can appear multiple times. See https://en.wikipedia.org/wiki/Multiset

I don’t think this is the same concept as an inventory (though I have to admit I’m not 100% sure I understand what an inventory is).

SpiceGuid · March 14, 2021, 2:44pm

Let’s suppose i have building instructions for a big lego model. How many bricks do i miss ? What official lego sets should i buy to recover most of them ? What order to BrickLink.com should i place to complete the picture ? How much bag 1.0.0 helps me to solve this multiset problem ?

Here are two additional multiset operations i suggest (from my empirical & programming experience in lego-building) :

val mul : t -> int -> t
Multiset multiplication. Because i own this multiset multiple times.

val div : t -> t -> (int * t)
Multiset division. How many times do i own this multiset ? Of course there is a (possibly empty) remainder.

backtracking · March 14, 2021, 4:52pm

Now I understand much better what you meant by inventory.
And your mul and div operations are nice suggestions indeed. Thanks!

gasche · March 14, 2021, 5:29pm

Nice! Some comments:

This is more than a fully-general bag, as you maintain an order structure. In a sense this feels like an implementation detail that leaks into the interface. Maybe later you may want to switch to a different implementation approach (for example using hashing, with ordered trees for the buckets) that does not make the ordering part of the interface efficient to provide? I have the vague feeling that it would be nicer to either call it ordered_bag, to make this aspect explicit, or to export both an ordered and a non-ordered interface.
I find it a bit disturbing that there is no complexity information on the functions. (But then those are arguably also “implementation details”.) The sideline comments on the fact that “internally bags are AVLs” lets me predict that most operations are logarithmic, but it would be nice to be explicit.
Why provide a hashing function in a documentation example instead of providing it as part of the interface? This would be a more robust way to discourage users from using the easy and wrong approach.
This type of datastructures is a perfect fit for QuickCheck-style testing, especially with the cool Crowbar library for whitebox fuzzing. This is fun, and also it encourages you to formulate first-order properties, which can also be useful documentation.
Bags with relative multiplicities make many things more convenient (everything substraction-related). I wonder if it would make sense to integrate them in the interface. (I would guess that for example one could add a positive : t -> t function dropping negative-multiplicity elements, and offer positive_diff and fold_positive as more efficient versions of positive (diff t1 t2) and fold f init (positive t).)

backtracking · March 14, 2021, 5:45pm

Here, order matters less than persistence. (Clément and I have use cases of bags where persistence is needed.) Order is here a mere consequence of the internal use of OCaml’s AVLs.

backtracking · March 14, 2021, 5:47pm

Good point. Yet, you’ll notice that there is not much about complexity in OCaml’s Map and Set either.

backtracking · March 14, 2021, 5:53pm

Ah ah! That’s exactly what I did in the first place. But providing a hash function in the functor interface requires a hash functions for elements in the functor’s argument. And that’s a bit intrusive, as several modules from the standard library, such as Bool/Int/Char/String, do not even provide a hash function. See my ticket https://github.com/ocaml/ocaml/issues/10259

Additionally, implementing a hash function for bags would have to be implemented using such a fold, so this is no big deal if this has to be done outside.

backtracking · March 14, 2021, 5:55pm

I thought about that for a while when designing the API but it sounded error-prone to have negative multiplicities. And this departs from the mathematical notion of multiset.

didier-wenzek · March 14, 2021, 8:30pm

Having relative multiplicities might depart a bit from the notion of multiset but gives a nice group structure. This can be used to implement undo/redo over a multiset using a stack of relative multisets. This can also be used in the inventory context cited above by @SpiceGuid to compute what is missing or redundant to build a given box.

backtracking · March 15, 2021, 8:20am

These are good arguments, indeed. Yet it does not seem obvious to conciliate the usual multiset semantics and relative multiplicities, e.g. should the bag difference (diff) simply make a subtraction or cap to zero? A non-intrusive solution would be a second functor Relative. We’ll have to think about it.

Thanks for the discussion.

gasche · March 15, 2021, 4:04pm

Regarding having an interface which handles both positive and relative weights, here was my proposal (but maybe having two different interfaces is nicer and cleaner indeed):

val diff : t -> t -> t 
(** [occ x (diff a b)] is [occ x a - occ x b] *)

val positive : t -> t
(** [occ x (positive a)] is [max 0 (occ x a)] *)

val positive_diff : t -> t -> t
(** [positive_diff a b] is equivalent to [positive (diff a b)] *)

val fold_positive : (elt -> int -> 'a -> 'a) -> t -> 'a -> 'a
(** [fold_positive f a init] is equivalent to [fold f (positive a) init] *)

With this interface I think that positive_diff is equivalent to the current diff, and fold_positive to the current fold. But there is a risk that users would call fold when they actually want fold_positive, which is avoided with the separate interface.

One tricky point: it is not obvious whether mem x b should be interpreted as occ x b > 0 or occ x b <> 0. I would expect mem x b to be equivalent to included (singleton x) b, and included makes sense with relative weights, so this suggests the occ x b > 0 interpretation. But then it means that fold, unlike fold_positive, may call the functions on elements that are not “members” of the multi-set – anti-elements?

rbardou · March 15, 2021, 4:56pm

(** [occ x (positive a)] is [min 0 (occ x a)] *)

Do you mean max 0 (occ x a)?

Topic		Replies	Views
[ANN] First release of baby Community announce	18	1474	April 4, 2025
New release of baby Community announce	5	398	December 10, 2024
Standardized mutable data structures Ecosystem	34	2479	October 29, 2022
Proposal for the replacement of Set and Map in the stdlib Community	6	563	April 3, 2025
Idiomatic group_by/distinct_by/count_by Learning first-class_modules	12	1506	October 25, 2023

[ANN] Release 1.0.0 of bag

Related topics