Does this "partition" function have a name?

triangle-man · October 5, 2022, 8:53am

I often seem to need the following computation, or a variation:

Given a list of things, and a function from things to ints (say), return a map from ints to lists of things.

That is: partition a list into sublists, where each sublist consists of those things having the same value of f thing.

Does this have a name and is it in a standard library?

The standard library Seq.partition does the equivalent for fs that return a boolean, but not the general case. If I understand correctly, Map.Make.of_seq makes maps from sequences but replaces entries with the same key.

Many thanks!

vds · October 5, 2022, 9:30am

Not really answering your question but I thought it was well-fitted for @art-w’s doc.sherlocode.com tool which is a nice way to search the ecosystem for a specific type signature:

With the query : ('a -> int) -> 'a list -> _ I was able to find one occurrence of a function performing a similar task: lascar’s Utils.ListExt.scatter : ('a -> int) -> 'a list -> 'a list array

So at least one author called this “scatter” and it looks like it is not part of the standard library
If I was to propose a name, maybe n_partition ?

dbuenzli · October 5, 2022, 9:51am

Don’t know if it’s in a standard library but usually I use the name classify for these kind of functions. Here’s an example which is slightly more general (a thing can belong to more than one class) than what you described.

triangle-man · October 5, 2022, 10:07am

Thanks both, that’s super helpful.

@dbuenzli : it will be very useful for me to read the code for your more general example as I am still getting to grips with the module system etc.

@vds: my corporate wifi is blocking sherlocode (who knows why) but I’m pretty confident I can get it unblocked and I was exactly trying to find such a lookup site.

JohnWhitington · October 5, 2022, 2:03pm

I have these two functions in my standard utility file:

(* Returns elements for which p is true, until one is not, paired with the
remaining list. The same as [takewhile p l], [dropwhile p l], but requiring
only one pass over the list. *)
let cleavewhile p l =
  let rec cleavewhile_inner p l elts =
    match l with
    | [] -> List.rev elts, []
    | e::es ->
        if p e
          then cleavewhile_inner p es (e::elts)
          else List.rev elts, l
  in
    cleavewhile_inner p l []

(* Collate a list into a list of lists based upon a comparison function by which
it has already been sorted. e.g [collate compare [1; 2; 2; 3; 3]] calculates
[[[1]; [2;2]; [3;3]]]. *)
let collate cmp l =
  let rec collate_inner prev = function
    | [] -> List.rev prev
    | h::t ->
        let x, y = cleavewhile (fun a -> cmp h a = 0) (h::t) in
          collate_inner (x::prev) y
  in
    collate_inner [] l

So building (f x, x) pairs, collating them and then post-processing, we may write:

let collect f l =
  List.map
    (function [] -> assert false | (fx, x)::t -> (fx, x::List.map snd t))
    (collate (fun (a, b) (a', b') -> compare a a') (List.sort compare (List.map (fun x -> (f x, x)) l)))

And we get:

# collect (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(0, [3; 6]); (1, [1; 4; 7]); (2, [2; 5; 8])]

The only conclusion I draw from this is that writing the function from scratch instead of building it from existing functions would be a lot neater…

sanette · October 5, 2022, 3:14pm

Something like this should work (for general, not-necessarily ordered keys), but I don’t know if this is efficient:

let partition f list =
  let table = Hashtbl.create (List.length list) in
  List.iter (fun x ->
      let y = f x in
      let q = match Hashtbl.find_opt table y with
        | None -> let q = Queue.create () in Hashtbl.add table y q; q
        | Some q -> q in
      Queue.add x q) list;
  Hashtbl.fold (fun y q ->
      List.cons (y, Queue.fold (fun l x -> x::l) [] q)) table []

# partition (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(1, [7; 4; 1]); (0, [6; 3]); (2, [8; 5; 2])]

c-cube · October 5, 2022, 3:26pm

I’d use a _ list ref instead of a queue, in each bucket of the
hashtbl, personally. Easier to add items to, and it turns into a list
trivially at the end.

n4323 · October 5, 2022, 4:05pm

In the context of dataframes, I believe this is called ‘groupby’. At least it’s called that in Pandas. That’s a bloated version of the same core concept. Also R has it.

jhw · October 5, 2022, 5:31pm

Looks like a specialization of the more general union-find algorithm to me.

kit-ty-kate · October 5, 2022, 6:29pm

In case you can’t get it unblocked you can always use opam grep instead

K_N · October 5, 2022, 6:39pm

As @n4323 said, this is just a GROUP BY in functional style. It has existed since SQL but in SQL
you don’t have nested tables, you have to apply an aggregate function to each sub-collection to return a list of pair (key, scalar value). Other languages are not limited by this data-model so e.g. Python has it in its itertools module and for instance Spark has it too (with a gazillion variations).

nobrowser · October 5, 2022, 10:46pm

This doesn’t seem like the right signature for this thing in general, though. What if the image of f is sparse? I think @triangle-man original idea of the result as a map makes better sense. Or just a list of ('a list, int) tuples.

–
Ian

art-w · October 5, 2022, 11:38pm

I guess you could try searching for simpler variations on the type:

: list -> list Map.Make.t reveals Core.Map.Make.of_alist_multi : (Key.t * 'a) list -> 'a list t
: list -> Hashtbl.t for a lot of solutions, because Hashtbl already accumulates same-key values in a list (extractable with find_all : Hashtbl.t -> list)
: ('a -> int) -> list -> list list for some group-by-equality stuff, or : ('a -> 'b) -> list -> (list * 'b) list for a more general group_by (not the first results though! )

It doesn’t seem so popular to obtain the group via an ('a -> key) function though, but the more common (key * 'a) list is a List.map away

triangle-man · October 6, 2022, 8:23am

Just to say I really appreciate all the thoughts here. Thank you.

nobrowser · October 18, 2022, 3:00am

How funny, a couple of weeks later this blog:

[Lazily Grouping in Haskell - Donnacha Oisín Kidney]

jhw · October 18, 2022, 5:06am

I’ll just drop this here. https://www.lri.fr/~filliatr/ftp/publis/puf-wml07.pdf

n4323 · October 20, 2022, 8:14am

Does this exist on opam?

hyphenrf · October 20, 2022, 9:58pm

opam grep

What! TIL

Topic		Replies	Views
A short GADT puzzle Learning gadt	12	1399	November 24, 2022
Thoughts on adding List.map_result, List.fold_left_result and similar functions to the Stdlib Learning	5	370	August 21, 2024
[ANN] release of iter 1.2 Community announce , iterator	8	1573	March 14, 2019
Adding indexed variants of all functions in stdlib Community string , stdlib , array , list , seq	6	669	August 11, 2023
Laziest encoding of map_product for Seq.t Learning	4	447	October 24, 2022

Does this "partition" function have a name?

Related topics