I often seem to need the following computation, or a variation:
Given a list of things, and a function from things to ints (say), return a map from ints to lists of things.
That is: partition a list into sublists, where each sublist consists of those things having the same value of f thing.
Does this have a name and is it in a standard library?
The standard library Seq.partition does the equivalent for fs that return a boolean, but not the general case. If I understand correctly, Map.Make.of_seq makes maps from sequences but replaces entries with the same key.
Not really answering your question but I thought it was well-fitted for @art-wās doc.sherlocode.com tool which is a nice way to search the ecosystem for a specific type signature:
So at least one author called this āscatterā and it looks like it is not part of the standard library
If I was to propose a name, maybe n_partition ?
Donāt know if itās in a standard library but usually I use the name classify for these kind of functions. Hereās an example which is slightly more general (a thing can belong to more than one class) than what you described.
@dbuenzli : it will be very useful for me to read the code for your more general example as I am still getting to grips with the module system etc.
@vds: my corporate wifi is blocking sherlocode (who knows why) but Iām pretty confident I can get it unblocked and I was exactly trying to find such a lookup site.
I have these two functions in my standard utility file:
(* Returns elements for which p is true, until one is not, paired with the
remaining list. The same as [takewhile p l], [dropwhile p l], but requiring
only one pass over the list. *)
let cleavewhile p l =
let rec cleavewhile_inner p l elts =
match l with
| [] -> List.rev elts, []
| e::es ->
if p e
then cleavewhile_inner p es (e::elts)
else List.rev elts, l
in
cleavewhile_inner p l []
(* Collate a list into a list of lists based upon a comparison function by which
it has already been sorted. e.g [collate compare [1; 2; 2; 3; 3]] calculates
[[[1]; [2;2]; [3;3]]]. *)
let collate cmp l =
let rec collate_inner prev = function
| [] -> List.rev prev
| h::t ->
let x, y = cleavewhile (fun a -> cmp h a = 0) (h::t) in
collate_inner (x::prev) y
in
collate_inner [] l
So building (f x, x) pairs, collating them and then post-processing, we may write:
let collect f l =
List.map
(function [] -> assert false | (fx, x)::t -> (fx, x::List.map snd t))
(collate (fun (a, b) (a', b') -> compare a a') (List.sort compare (List.map (fun x -> (f x, x)) l)))
And we get:
# collect (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(0, [3; 6]); (1, [1; 4; 7]); (2, [2; 5; 8])]
The only conclusion I draw from this is that writing the function from scratch instead of building it from existing functions would be a lot neaterā¦
Something like this should work (for general, not-necessarily ordered keys), but I donāt know if this is efficient:
let partition f list =
let table = Hashtbl.create (List.length list) in
List.iter (fun x ->
let y = f x in
let q = match Hashtbl.find_opt table y with
| None -> let q = Queue.create () in Hashtbl.add table y q; q
| Some q -> q in
Queue.add x q) list;
Hashtbl.fold (fun y q ->
List.cons (y, Queue.fold (fun l x -> x::l) [] q)) table []
# partition (fun x -> x mod 3) [1; 2; 3; 4; 5; 6; 7; 8];;
- : (int * int list) list = [(1, [7; 4; 1]); (0, [6; 3]); (2, [8; 5; 2])]
Iād use a _ list ref instead of a queue, in each bucket of the
hashtbl, personally. Easier to add items to, and it turns into a list
trivially at the end.
In the context of dataframes, I believe this is called āgroupbyā. At least itās called that in Pandas. Thatās a bloated version of the same core concept. Also R has it.
As @n4323 said, this is just a GROUP BY in functional style. It has existed since SQL but in SQL
you donāt have nested tables, you have to apply an aggregate function to each sub-collection to return a list of pair (key, scalar value). Other languages are not limited by this data-model so e.g. Python has it in its itertools module and for instance Spark has it too (with a gazillion variations).
This doesnāt seem like the right signature for this thing in general, though. What if the image of f is sparse? I think @triangle-man original idea of the result as a map makes better sense. Or just a list of ('a list, int) tuples.
: list -> Hashtbl.t for a lot of solutions, because Hashtbl already accumulates same-key values in a list (extractable with find_all: Hashtbl.t -> list)