Does this "partition" function have a name?

I often seem to need the following computation, or a variation:

Given a list of things, and a function from things to ints (say), return a map from ints to lists of things.

That is: partition a list into sublists, where each sublist consists of those things having the same value of f thing.

Does this have a name and is it in a standard library?

The standard library Seq.partition does the equivalent for fs that return a boolean, but not the general case. If I understand correctly, Map.Make.of_seq makes maps from sequences but replaces entries with the same key.

Many thanks!

Not really answering your question but I thought it was well-fitted for @art-w’s doc.sherlocode.com tool which is a nice way to search the ecosystem for a specific type signature:

With the query : ('a -> int) -> 'a list -> _ I was able to find one occurrence of a function performing a similar task: lascar’s Utils.ListExt.scatter : ('a -> int) -> 'a list -> 'a list array

So at least one author called this ā€œscatterā€ and it looks like it is not part of the standard library :slight_smile:
If I was to propose a name, maybe n_partition ?

5 Likes

Don’t know if it’s in a standard library but usually I use the name classify for these kind of functions. Here’s an example which is slightly more general (a thing can belong to more than one class) than what you described.

Thanks both, that’s super helpful.

@dbuenzli : it will be very useful for me to read the code for your more general example as I am still getting to grips with the module system etc.

@vds: my corporate wifi is blocking sherlocode (who knows why) but I’m pretty confident I can get it unblocked and I was exactly trying to find such a lookup site.

I have these two functions in my standard utility file:

(* Returns elements for which p is true, until one is not, paired with the
remaining list. The same as [takewhile p l], [dropwhile p l], but requiring
only one pass over the list. *)
let cleavewhile p l =
  let rec cleavewhile_inner p l elts =
    match l with
    | [] -> List.rev elts, []
    | e::es ->
        if p e
          then cleavewhile_inner p es (e::elts)
          else List.rev elts, l
  in
    cleavewhile_inner p l []

(* Collate a list into a list of lists based upon a comparison function by which
it has already been sorted. e.g [collate compare [1; 2; 2; 3; 3]] calculates
[[[1]; [2;2]; [3;3]]]. *)
let collate cmp l =
  let rec collate_inner prev = function
    | [] -> List.rev prev
    | h::t ->
        let x, y = cleavewhile (fun a -> cmp h a = 0) (h::t) in
          collate_inner (x::prev) y
  in
    collate_inner [] l

So building (f x, x) pairs, collating them and then post-processing, we may write:

let collect f l =
  List.map
    (function [] -> assert false | (fx, x)::t -> (fx, x::List.map snd t))
    (collate (fun (a, b) (a', b') -> compare a a') (List.sort compare (List.map (fun x -> (f x, x)) l)))

And we get:

# collect (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(0, [3; 6]); (1, [1; 4; 7]); (2, [2; 5; 8])]

The only conclusion I draw from this is that writing the function from scratch instead of building it from existing functions would be a lot neater…

Something like this should work (for general, not-necessarily ordered keys), but I don’t know if this is efficient:

let partition f list =
  let table = Hashtbl.create (List.length list) in
  List.iter (fun x ->
      let y = f x in
      let q = match Hashtbl.find_opt table y with
        | None -> let q = Queue.create () in Hashtbl.add table y q; q
        | Some q -> q in
      Queue.add x q) list;
  Hashtbl.fold (fun y q ->
      List.cons (y, Queue.fold (fun l x -> x::l) [] q)) table []
# partition (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(1, [7; 4; 1]); (0, [6; 3]); (2, [8; 5; 2])]

I’d use a _ list ref instead of a queue, in each bucket of the
hashtbl, personally. Easier to add items to, and it turns into a list
trivially at the end.

In the context of dataframes, I believe this is called ā€˜groupby’. At least it’s called that in Pandas. That’s a bloated version of the same core concept. Also R has it.

Looks like a specialization of the more general union-find algorithm to me.

In case you can’t get it unblocked you can always use opam grep instead

2 Likes

As @n4323 said, this is just a GROUP BY in functional style. It has existed since SQL but in SQL
you don’t have nested tables, you have to apply an aggregate function to each sub-collection to return a list of pair (key, scalar value). Other languages are not limited by this data-model so e.g. Python has it in its itertools module and for instance Spark has it too (with a gazillion variations).

This doesn’t seem like the right signature for this thing in general, though. What if the image of f is sparse? I think @triangle-man original idea of the result as a map makes better sense. Or just a list of ('a list, int) tuples.

–
Ian

I guess you could try searching for simpler variations on the type:

It doesn’t seem so popular to obtain the group via an ('a -> key) function though, but the more common (key * 'a) list is a List.map away :slight_smile:

Just to say I really appreciate all the thoughts here. Thank you.

How funny, a couple of weeks later this blog:

[Lazily Grouping in Haskell - Donnacha OisĆ­n Kidney]

I’ll just drop this here. https://www.lri.fr/~filliatr/ftp/publis/puf-wml07.pdf

Does this exist on opam?

opam grep

What! TIL :smile: