Does this "partition" function have a name?

I often seem to need the following computation, or a variation:

Given a list of things, and a function from things to ints (say), return a map from ints to lists of things.

That is: partition a list into sublists, where each sublist consists of those things having the same value of f thing.

Does this have a name and is it in a standard library?

The standard library Seq.partition does the equivalent for fs that return a boolean, but not the general case. If I understand correctly, Map.Make.of_seq makes maps from sequences but replaces entries with the same key.

Many thanks!

Not really answering your question but I thought it was well-fitted for @art-wā€™s doc.sherlocode.com tool which is a nice way to search the ecosystem for a specific type signature:

With the query : ('a -> int) -> 'a list -> _ I was able to find one occurrence of a function performing a similar task: lascarā€™s Utils.ListExt.scatter : ('a -> int) -> 'a list -> 'a list array

So at least one author called this ā€œscatterā€ and it looks like it is not part of the standard library :slight_smile:
If I was to propose a name, maybe n_partition ?

5 Likes

Donā€™t know if itā€™s in a standard library but usually I use the name classify for these kind of functions. Hereā€™s an example which is slightly more general (a thing can belong to more than one class) than what you described.

Thanks both, thatā€™s super helpful.

@dbuenzli : it will be very useful for me to read the code for your more general example as I am still getting to grips with the module system etc.

@vds: my corporate wifi is blocking sherlocode (who knows why) but Iā€™m pretty confident I can get it unblocked and I was exactly trying to find such a lookup site.

I have these two functions in my standard utility file:

(* Returns elements for which p is true, until one is not, paired with the
remaining list. The same as [takewhile p l], [dropwhile p l], but requiring
only one pass over the list. *)
let cleavewhile p l =
  let rec cleavewhile_inner p l elts =
    match l with
    | [] -> List.rev elts, []
    | e::es ->
        if p e
          then cleavewhile_inner p es (e::elts)
          else List.rev elts, l
  in
    cleavewhile_inner p l []

(* Collate a list into a list of lists based upon a comparison function by which
it has already been sorted. e.g [collate compare [1; 2; 2; 3; 3]] calculates
[[[1]; [2;2]; [3;3]]]. *)
let collate cmp l =
  let rec collate_inner prev = function
    | [] -> List.rev prev
    | h::t ->
        let x, y = cleavewhile (fun a -> cmp h a = 0) (h::t) in
          collate_inner (x::prev) y
  in
    collate_inner [] l

So building (f x, x) pairs, collating them and then post-processing, we may write:

let collect f l =
  List.map
    (function [] -> assert false | (fx, x)::t -> (fx, x::List.map snd t))
    (collate (fun (a, b) (a', b') -> compare a a') (List.sort compare (List.map (fun x -> (f x, x)) l)))

And we get:

# collect (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(0, [3; 6]); (1, [1; 4; 7]); (2, [2; 5; 8])]

The only conclusion I draw from this is that writing the function from scratch instead of building it from existing functions would be a lot neaterā€¦

Something like this should work (for general, not-necessarily ordered keys), but I donā€™t know if this is efficient:

let partition f list =
  let table = Hashtbl.create (List.length list) in
  List.iter (fun x ->
      let y = f x in
      let q = match Hashtbl.find_opt table y with
        | None -> let q = Queue.create () in Hashtbl.add table y q; q
        | Some q -> q in
      Queue.add x q) list;
  Hashtbl.fold (fun y q ->
      List.cons (y, Queue.fold (fun l x -> x::l) [] q)) table []
# partition (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(1, [7; 4; 1]); (0, [6; 3]); (2, [8; 5; 2])]

Iā€™d use a _ list ref instead of a queue, in each bucket of the
hashtbl, personally. Easier to add items to, and it turns into a list
trivially at the end.

In the context of dataframes, I believe this is called ā€˜groupbyā€™. At least itā€™s called that in Pandas. Thatā€™s a bloated version of the same core concept. Also R has it.

Looks like a specialization of the more general union-find algorithm to me.

In case you canā€™t get it unblocked you can always use opam grep instead

2 Likes

As @n4323 said, this is just a GROUP BY in functional style. It has existed since SQL but in SQL
you donā€™t have nested tables, you have to apply an aggregate function to each sub-collection to return a list of pair (key, scalar value). Other languages are not limited by this data-model so e.g. Python has it in its itertools module and for instance Spark has it too (with a gazillion variations).

This doesnā€™t seem like the right signature for this thing in general, though. What if the image of f is sparse? I think @triangle-man original idea of the result as a map makes better sense. Or just a list of ('a list, int) tuples.

ā€“
Ian

I guess you could try searching for simpler variations on the type:

It doesnā€™t seem so popular to obtain the group via an ('a -> key) function though, but the more common (key * 'a) list is a List.map away :slight_smile:

Just to say I really appreciate all the thoughts here. Thank you.

How funny, a couple of weeks later this blog:

[Lazily Grouping in Haskell - Donnacha OisĆ­n Kidney]

Iā€™ll just drop this here. https://www.lri.fr/~filliatr/ftp/publis/puf-wml07.pdf

Does this exist on opam?

opam grep

What! TIL :smile: