Clément and myself are pleased to announce the first release of bag
, a library of multisets.
Here is the documentation
Clément and myself are pleased to announce the first release of bag
, a library of multisets.
Here is the documentation
Is a multiset an inventory ? If no then what’s the difference ? If yes then i can imagine some inventory operation(s) that bag 1.0.0 does not offer.
A multiset is nothing more than a set where elements can appear multiple times. See https://en.wikipedia.org/wiki/Multiset
I don’t think this is the same concept as an inventory (though I have to admit I’m not 100% sure I understand what an inventory is).
Let’s suppose i have building instructions for a big lego model. How many bricks do i miss ? What official lego sets should i buy to recover most of them ? What order to BrickLink.com should i place to complete the picture ? How much bag 1.0.0 helps me to solve this multiset problem ?
Here are two additional multiset operations i suggest (from my empirical & programming experience in lego-building) :
val mul : t -> int -> t
Multiset multiplication. Because i own this multiset multiple times.
val div : t -> t -> (int * t)
Multiset division. How many times do i own this multiset ? Of course there is a (possibly empty) remainder.
Now I understand much better what you meant by inventory.
And your mul
and div
operations are nice suggestions indeed. Thanks!
Nice! Some comments:
This is more than a fully-general bag, as you maintain an order structure. In a sense this feels like an implementation detail that leaks into the interface. Maybe later you may want to switch to a different implementation approach (for example using hashing, with ordered trees for the buckets) that does not make the ordering part of the interface efficient to provide? I have the vague feeling that it would be nicer to either call it ordered_bag
, to make this aspect explicit, or to export both an ordered and a non-ordered interface.
I find it a bit disturbing that there is no complexity information on the functions. (But then those are arguably also “implementation details”.) The sideline comments on the fact that “internally bags are AVLs” lets me predict that most operations are logarithmic, but it would be nice to be explicit.
Why provide a hashing function in a documentation example instead of providing it as part of the interface? This would be a more robust way to discourage users from using the easy and wrong approach.
This type of datastructures is a perfect fit for QuickCheck-style testing, especially with the cool Crowbar library for whitebox fuzzing. This is fun, and also it encourages you to formulate first-order properties, which can also be useful documentation.
Bags with relative multiplicities make many things more convenient (everything substraction-related). I wonder if it would make sense to integrate them in the interface. (I would guess that for example one could add a positive : t -> t
function dropping negative-multiplicity elements, and offer positive_diff
and fold_positive
as more efficient versions of positive (diff t1 t2)
and fold f init (positive t)
.)
Here, order matters less than persistence. (Clément and I have use cases of bags where persistence is needed.) Order is here a mere consequence of the internal use of OCaml’s AVLs.
Good point. Yet, you’ll notice that there is not much about complexity in OCaml’s Map
and Set
either.
Ah ah! That’s exactly what I did in the first place. But providing a hash
function in the functor interface requires a hash
functions for elements in the functor’s argument. And that’s a bit intrusive, as several modules from the standard library, such as Bool
/Int
/Char
/String
, do not even provide a hash
function. See my ticket https://github.com/ocaml/ocaml/issues/10259
Additionally, implementing a hash
function for bags would have to be implemented using such a fold
, so this is no big deal if this has to be done outside.
I thought about that for a while when designing the API but it sounded error-prone to have negative multiplicities. And this departs from the mathematical notion of multiset.
Having relative multiplicities might depart a bit from the notion of multiset but gives a nice group structure. This can be used to implement undo/redo over a multiset using a stack of relative multisets. This can also be used in the inventory context cited above by @SpiceGuid to compute what is missing or redundant to build a given box.
These are good arguments, indeed. Yet it does not seem obvious to conciliate the usual multiset semantics and relative multiplicities, e.g. should the bag difference (diff
) simply make a subtraction or cap to zero? A non-intrusive solution would be a second functor Relative
. We’ll have to think about it.
Thanks for the discussion.
Regarding having an interface which handles both positive and relative weights, here was my proposal (but maybe having two different interfaces is nicer and cleaner indeed):
val diff : t -> t -> t
(** [occ x (diff a b)] is [occ x a - occ x b] *)
val positive : t -> t
(** [occ x (positive a)] is [max 0 (occ x a)] *)
val positive_diff : t -> t -> t
(** [positive_diff a b] is equivalent to [positive (diff a b)] *)
val fold_positive : (elt -> int -> 'a -> 'a) -> t -> 'a -> 'a
(** [fold_positive f a init] is equivalent to [fold f (positive a) init] *)
With this interface I think that positive_diff
is equivalent to the current diff
, and fold_positive
to the current fold
. But there is a risk that users would call fold
when they actually want fold_positive
, which is avoided with the separate interface.
One tricky point: it is not obvious whether mem x b
should be interpreted as occ x b > 0
or occ x b <> 0
. I would expect mem x b
to be equivalent to included (singleton x) b
, and included
makes sense with relative weights, so this suggests the occ x b > 0
interpretation. But then it means that fold
, unlike fold_positive
, may call the functions on elements that are not “members” of the multi-set – anti-elements?
(** [occ x (positive a)] is [min 0 (occ x a)] *)
Do you mean max 0 (occ x a)
?