Idiomatic group_by/distinct_by/count_by

Jon_Harrop · February 23, 2022, 4:42pm

I love using pipelines of combinators to munge data but OCaml’s stdlib is missing some obvious ones such as group_by/distinct_by/count_by. You can write them easily enough in vanilla OCaml using Hashtbl, e.g. Array.distinct_by:

let distinct_by f xs =
  let seen = Hashtbl.create 16 in
  let result = ExtArray.empty() in
  xs
  |> Array.iter (fun x ->
    let fx = f x in
    if Hashtbl.mem seen fx then () else
      ( Hashtbl.add seen fx ();
        ExtArray.append result x ));
  ExtArray.to_array result

but that relies upon polymorphic equality and hashing. If I want to write these kinds of functions using functors I guess I’d use first-class modules to pass a concrete Hashtbl implementation (generated by Hashtbl.Make) into that function. What exactly would that look like?

Furthermore, can I factor out any commonality between equivalent functions for different ordered collections such as List, Array, Stack, Queue and so on using functors?

jfeser · February 23, 2022, 5:43pm

Here’s an example from the iter library iter/Iter.ml at master · c-cube/iter · GitHub. I’m a fan of using iterators to abstract over different collection types. The stdlib has Seq now, which can serve the same purpose.

Maelan · February 23, 2022, 5:45pm

Funnily enough, I played with the problem of implementing classifiers recently. Here is what I came with. One of the attempts use first-class modules indeed. Not saying that’s the best possible interface; I let you judge. That might at least answer your question:

What exactly would that look like?

benjamin-thomas · October 15, 2023, 11:07am

I was very surprised to find out group_by is missing. It’s the kind of function I make heavy use of in a typical program.

Also, it doesn’t look like the alternative standard libraries provide it either.

This is what I came up with:

let group_by f lst =
  let rec aux acc f = function
    | [] -> acc
    | h :: t ->
        let new_acc =
          let n = f h in
          match List.assoc_opt n acc with
          | None -> (n, [ h ]) :: acc
          | Some t2 -> (n, h :: t2) :: List.remove_assoc n acc
        in
        aux new_acc f t
  in
  aux [] f lst
;;

let%test _ =
  group_by List.length
    [ [ 'a'; 'b'; 'c' ]
    ; [ 'd'; 'e' ]
    ; [ 'f'; 'g'; 'h' ]
    ; [ 'd'; 'e' ]
    ; [ 'i'; 'j'; 'k'; 'l' ]
    ; [ 'm'; 'n' ]
    ; [ 'o' ]
    ]
  = [ (1, [ [ 'o' ] ])
    ; (2, [ [ 'm'; 'n' ]; [ 'd'; 'e' ]; [ 'd'; 'e' ] ])
    ; (4, [ [ 'i'; 'j'; 'k'; 'l' ] ])
    ; (3, [ [ 'f'; 'g'; 'h' ]; [ 'a'; 'b'; 'c' ] ])
    ]
;;

let%test _ =
  group_by List.length [ [ 1; 2; 3 ]; [ 4; 5 ]; []; [ 7; 8 ]; [ 9; 10; 11 ] ]
  = [ (3, [ [ 9; 10; 11 ]; [ 1; 2; 3 ] ])
    ; (2, [ [ 7; 8 ]; [ 4; 5 ] ])
    ; (0, [ [] ])
    ]
;;

let%test _ =
  let lt x n = n < x in
  group_by (lt 10) [ 2; 4; 8; 16; 32; 64 ]
  = [ (false, [ 64; 32; 16 ]); (true, [ 8; 4; 2 ]) ]
;;

I’d be interested to know if there are downsides to this approach, I haven’t learned about the Hashtbl module yet.

Note that this API is similar to F#'s:

> List.groupBy List.length [ [ 1; 2; 3 ]; [ 4; 5 ]; []; [ 7; 8 ]; [ 9; 10; 11 ] ];;
val it: (int * int list list) list =
  [(3, [[1; 2; 3]; [9; 10; 11]]); (2, [[4; 5]; [7; 8]]); (0, [[]])]

> let lt x n = n < x;;   
val lt: x: 'a -> n: 'a -> bool when 'a: comparison

> List.groupBy (lt 10) [ 2; 4; 8; 16; 32; 64 ];;
val it: (bool * int list) list = [(true, [2; 4; 8]); (false, [16; 32; 64])]

K_N · October 15, 2023, 12:58pm

Two things:

the part of your group_by that is using List.assoc makes the whole thing quadratic in the size of the input (e.g., when every value is in its own group). You can alleviate this by using a Hashtbl.t.
If you want to use only list, you could first sort by key value and then linearly scan and group the contiguous keys, which would make the whole thing NlogN.
in all these cases you implicitly rely on polymorphic equality (or comparison if you sort), for instance in List.assoc. So, if you don’t want that, you have to pass also functions to compare and hash keys together with the function that extract the key from the value.

benjamin-thomas · October 19, 2023, 6:12pm

What you said made sense to me.

I thought I should try to measure the probable improvements, but to my surprise my “naive” implementation seems to be the fastest on small to medium size lists.

I haven’t figured out yet how to properly implement group_by with a Hashtbl but it looks like just putting data into it is already more expensive.

Here’s my dune file:

(executable
 (name bench)
 (libraries core_unix.command_unix core core_bench))

And my test file:

module Bench = Core_bench.Bench

(* dune exec ./bench.exe *)

let range lo hi =
  let rec aux acc curr hi =
    if curr > hi then
      acc
    else
      aux (curr :: acc) (curr + 1) hi
  in
  if lo < hi then
    aux [] lo hi |> List.rev
  else
    aux [] hi lo
;;

let lst =
  [ range 1 3
  ; range 4 5
  ; range 1 10
  ; range 2 5
  ; range 40 2999
  ; range 3 99
  ; range 200 300
  ; range 9 50
  ]
;;

let t1 =
  let group_by f lst =
    let rec aux acc f = function
      | [] -> acc
      | h :: t ->
          let new_acc =
            let n = f h in
            match List.assoc_opt n acc with
            | None -> (n, [ h ]) :: acc
            | Some t2 -> (n, h :: t2) :: List.remove_assoc n acc
          in
          aux new_acc f t
    in
    aux [] f lst
  in
  Bench.Test.create ~name:"group_by (naive impl)" (fun () ->
      group_by List.length lst)
;;

let t2 =
  let group_by f lst =
    let tbl = Hashtbl.create 1024 in
    let rec aux f = function
      | [] -> ()
      | h :: t ->
          let n = f h in
          ()
          ; Hashtbl.add tbl n h
          ; aux f t
    in
    ()
    ; aux f lst
    ; tbl
      |> Hashtbl.to_seq_keys
      |> List.of_seq
      |> List.sort_uniq compare
      |> List.map (fun key -> (key, Hashtbl.find_all tbl key))
  in
  Bench.Test.create ~name:"group_by (Hashtbl, ugly impl)" (fun () ->
      group_by List.length lst)
;;

let t3 =
  let group_contiguous lst =
    let rec aux prev acc acc2 = function
      | [] -> (
          match acc with
          | [] -> acc2
          | _ -> (prev, acc) :: acc2)
      | (n, xs) :: t ->
          if n = prev then
            aux n (xs :: acc) acc2 t
          else
            aux n [ xs ] ((prev, acc) :: acc2) t
    in
    match lst with
    | [] -> []
    | (n, xs) :: t -> aux n [ xs ] [] t
  in

  let group_by f lst =
    lst
    |> List.map (fun sub_lst -> (f sub_lst, sub_lst))
    |> List.sort (Fun.flip compare)
    |> group_contiguous
  in
  Bench.Test.create ~name:"group_by (no Hashtbl, group_contiguous)" (fun () ->
      group_by List.length lst)
;;

let t4 =
  let fill_hashtbl f lst =
    let tbl = Hashtbl.create 1024 in
    let rec aux f = function
      | [] -> ()
      | h :: t ->
          let n = f h in
          ()
          ; Hashtbl.add tbl n h
          ; aux f t
    in
    ()
    ; aux f lst
    ; tbl
  in
  Bench.Test.create ~name:"group_by (just fill the Hashtbl)" (fun () ->
      fill_hashtbl List.length lst)
;;

let tests = [ t1; t2; t3; t4 ]
let command = Core_bench.Bench.make_command tests
let () = Command_unix.run command

This is what I get:

┌─────────────────────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name                                    │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────────────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ group_by (naive impl)                   │   3.46us │  71.97w │          │          │     50.84% │
│ group_by (Hashtbl, ugly impl)           │   6.80us │ 473.14w │   1.06kw │   32.06w │    100.00% │
│ group_by (no Hashtbl, group_contiguous) │   3.49us │ 240.04w │          │          │     51.28% │
│ group_by (just fill the Hashtbl)        │   5.26us │  42.01w │   1.06kw │   32.01w │     77.38% │
└─────────────────────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘

Am I misunderstand something?

dimitris · October 19, 2023, 7:23pm

Well, the definition of big O says that: f(x) is O(g(x) if there exist c and x_0 such that for all x > x_0, we have f(x) <= c * g(x).

This means, you should expect to see the effect for sufficiently large input…

(From a quick look at your code, there’s no medium size input, only small of up to a few thousand elements.)

Update: your code has some mistake because with 500,000-800,000 elements, the tests take approximately the same time.

K_N · October 19, 2023, 7:24pm

So here are a few things. First, if I understand correctly, your collection that you are trying to group by
is made of 8 lists that you are trying to group by lenghts. It also happens that they are all of distinct lengths. I’m not sure this is quite a good enough benchmark. Indeed in such a benchmark even List.assoc may be competitive (if you only traverse a list of length at most 8).

That being said there are a few places were your use of hash tables or sorting is not optimal:

 |> Hashtbl.to_seq_keys
      |> List.of_seq
      |> List.sort_uniq compare
      |> List.map (fun key -> (key, Hashtbl.find_all tbl key))

This is what dominates your time for hash tables. You don’t need all this. Also, internally, buckets of hashtables are not lists (meaning not the type list) but mutable linked lists. So when you do find_all you actually have to reconstruct a lists from the mutable bucket. The more straightforward version below is much faster:

let t2 =
  let group_by f lst =
    let tbl = Hashtbl.create 16 in
    let rec aux f = function
      | [] -> ()
      | h :: t ->
        let n = f h in
        let l = try Hashtbl.find tbl n with Not_found -> [] in
        Hashtbl.replace tbl n (h::l);
        aux f t
    in
    aux f lst;
    Hashtbl.fold (fun k _ acc -> (k, Hashtbl.find tbl k)::acc) tbl [] 
  in
  Bench.Test.create ~name:"group_by (Hashtbl, ugly impl)" (fun () ->
      group_by List.length lst)
;;

Also there is no need to allocated a table of size 1024 if only 8 values will be used (this should not impact much).
For the sorting part, you use:

   |> List.map (fun sub_lst -> (f sub_lst, sub_lst))
    |> List.sort (Fun.flip compare)
    |> group_contiguous

Which means that for lists of equal lengths (you don’t have any) you will call compare on the second component of the list. You should just do:

lst
   |> List.map (fun sub_lst -> (f sub_lst, sub_lst))
   |> List.sort (fun (x, _) (y, _) -> compare x y)
   |> group_contiguous

with this modifications, All versions takes about the same time (depending on the run).

K_N · October 19, 2023, 7:47pm

So I did a few more experiments. First, I changed the input data to be:

let lst = Arg.read_arg "american-english" |> Array.to_list

where american-english is the default list of English words used for spell checking on my system (/usr/share/dict/american-english). 104000ish words each on one line. If I run your tests but with String.length as the group_by key then I get:

┌─────────────────────────────────────────┬──────────┬────────────┬────────────┬────────────┬────────────┐
│ Name                                    │ Time/Run │    mWd/Run │   mjWd/Run │   Prom/Run │ Percentage │
├─────────────────────────────────────────┼──────────┼────────────┼────────────┼────────────┼────────────┤
│ group_by (naive impl)                   │  13.85ms │ 2_451.49kw │   297.79kw │   297.79kw │     31.40% │
│ group_by (Hashtbl, ugly impl)           │   8.51ms │   313.27kw │   313.43kw │   313.43kw │     19.30% │
│ group_by (no Hashtbl, group_contiguous) │  44.10ms │ 6_252.34kw │ 2_077.46kw │ 2_077.46kw │    100.00% │
│ group_by (just fill the Hashtbl)        │   7.59ms │   418.47kw │   678.64kw │   417.50kw │     17.21% │
└─────────────────────────────────────────┴──────────┴────────────┴────────────┴────────────┴────────────┘

Here the Hashtbl implementation wins, but “naive” is not so bad. Indeed, there are very few distinct word lengths (from 1 to 30 or so chars). So again you do insertions in a small list and that’s fine. Here the sorting is not so good since you pay the cost up front of mapping (length, word) for each of the 104000 words and then sort that.
But now, with the same data you group by “Hashtbl.hash” that is, you try to find all words in the dictionary for which OCaml’s default hash function has a collision. Lo and behold:

┌─────────────────────────────────────────┬─────────────┬────────────┬────────────┬────────────┬────────────┐
│ Name                                    │    Time/Run │    mWd/Run │   mjWd/Run │   Prom/Run │ Percentage │
├─────────────────────────────────────────┼─────────────┼────────────┼────────────┼────────────┼────────────┤
│ group_by (naive impl)                   │ 48_209.15ms │ 1_355.97kw │ 1_032.19kw │ 1_032.19kw │    100.00% │
│ group_by (Hashtbl, ugly impl)           │     38.17ms │ 1_357.45kw │ 1_515.15kw │ 1_254.02kw │      0.08% │
│ group_by (no Hashtbl, group_contiguous) │     66.87ms │ 6_878.18kw │ 2_741.10kw │ 2_741.10kw │      0.14% │
│ group_by (just fill the Hashtbl)        │     17.67ms │   418.48kw │   679.27kw │   418.13kw │      0.04% │
└─────────────────────────────────────────┴─────────────┴────────────┴────────────┴────────────┴────────────┘

Here the distribution of the keys is much much larger (and you only have a few collisions so your groups are never more than two or so elements).
Edit: typos.

benjamin-thomas · October 19, 2023, 8:55pm

Thanks, I did not know about this definition. I’m not sure how to apply it in concrete terms but yes, I do see a difference if I bump up the list size now.

Sorry for not being clearer. So the list I wanted to process was in fact smaller, it was the one I originally posted. But I did notice that I was doing sub-optimal list operations. So when you pointed at possible deficiencies, I expected to seeing them reflected even in small inputs.

In many languages, group_by is provided and I don’t have to think too much about it. So if I were to design my own function, I would like to understand how to evaluate what would be a good “overall” solution. So that’s why I was asking more feedback.

Thanks! I appreciate the additional explanation. It’s cool that you found a way to demonstrate the problem with the naive version. I like your idea of grouping contiguous items btw, and I’ll take a good look at your Hashtbl implementation

UnixJunkie · October 22, 2023, 4:19am

The famous Jon Harrop doesn’t use an extended standard library for OCaml?
He is still just using the stdlib?
Poor Jon Harrop.

PS: in batteries, look at BatList.group for example

benjamin-thomas · October 22, 2023, 7:11am

So I did look at the “containers” library first, which I understood more or less replaces “batteries”.

I found container’s implementation a little too complicated for my current OCaml knowledge.

Both batteries and containers don’t return the key on which we group things, so these implementations aren’t very useful to me.

Here’s a recap from a few languages providing a built-in group_by function:

Ruby:

irb(main):001:0> lst = [%w[a b c], %w[d e], %w[f g h], %w[d e], %w[i j k l], %w[m n], %w[o]]
irb(main):002:0> lst.group_by(&:length)
=> {3=>[["a", "b", "c"], ["f", "g", "h"]], 2=>[["d", "e"], ["d", "e"], ["m", "n"]], 4=>[["i", "j", "k", "l"]], 1=>[["o"]]}

---

Clojure:

user=> (def lst '((a b c) (d e) (f g h) (d e) (i j k l) (m n) (o)))
{3 [(a b c) (f g h)], 2 [(d e) (d e) (m n)], 4 [(i j k l)], 1 [(o)]}
user=> (group-by count lst)
{3 [(a b c) (f g h)], 2 [(d e) (d e) (m n)], 4 [(i j k l)], 1 [(o)]}


---

Elixir:

iex(1)> lst = [['a', 'b', 'c'], ['d', 'e'], ['f', 'g', 'h'], ['d', 'e'], ['i', 'j', 'k', 'l'], ['m', 'n'], ['o']]
[
  ['a', 'b', 'c'],
  ['d', 'e'],
  ['f', 'g', 'h'],
  ['d', 'e'],
  ['i', 'j', 'k', 'l'],
  ['m', 'n'],
  ['o']
]
iex(2)> lst |> Enum.group_by(&length(&1))
%{
  1 => [['o']],
  2 => [['d', 'e'], ['d', 'e'], ['m', 'n']],
  3 => [['a', 'b', 'c'], ['f', 'g', 'h']],
  4 => [['i', 'j', 'k', 'l']]
}


---

F#:

> let lst = [['a';'b';'c'];['d';'e'];['f';'g';'h'];['d';'e'];['i';'j';'k';'l'];['m';'n'];['o']];;
val lst: char list list =
  [['a'; 'b'; 'c']; ['d'; 'e']; ['f'; 'g'; 'h']; ['d'; 'e'];
   ['i'; 'j'; 'k'; 'l']; ['m'; 'n']; ['o']]

> List.groupBy List.length lst;;
val it: (int * char list list) list =
  [(3, [['a'; 'b'; 'c']; ['f'; 'g'; 'h']]);
   (2, [['d'; 'e']; ['d'; 'e']; ['m'; 'n']]); (4, [['i'; 'j'; 'k'; 'l']]);
   (1, [['o']])]
   
---

Java:

jshell> var lst = List.of(List.of('a','b','c'), List.of('d','e'), List.of('f','g','h'), List.of('d','e'), List.of('i','j','k','l'), List.of('m', 'n'), List.of('o'))
lst ==> [[a, b, c], [d, e], [f, g, h], [d, e], [i, j, k, l], [m, n], [o]]
jshell> lst.stream().collect(Collectors.groupingBy(List::size))
$13 ==> {1=[[o]], 2=[[d, e], [d, e], [m, n]], 3=[[a, b, c], [f, g, h]], 4=[[i, j, k, l]]}

---

OCaml batteries, no key :(

─( 08:31:06 )─< command 0 >────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # #require "batteries";;
─( 08:31:07 )─< command 1 >────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # let lst = [['a';'b';'c'];['d';'e'];['f';'g';'h'];['d';'e'];['i';'j';'k';'l'];['m';'n'];['o']];;
val lst : char list list =
  [['a'; 'b'; 'c']; ['d'; 'e']; ['f'; 'g'; 'h']; ['d'; 'e'];
   ['i'; 'j'; 'k'; 'l']; ['m'; 'n']; ['o']]
─( 08:31:16 )─< command 2 >────────────────────────────────────────────────────────────────────────────────{ counter: 0 }─
utop # BatList.group (fun a b -> compare (List.length a) (List.length b)) lst;;
- : char list list list =
[[['o']]; [['d'; 'e']; ['d'; 'e']; ['m'; 'n']];
 [['a'; 'b'; 'c']; ['f'; 'g'; 'h']]; [['i'; 'j'; 'k'; 'l']]]

Here’s the final version I settled on, for ref (K_N’s impl slightly tweaked at the end):

let group_by f lst =
  let tbl = Hashtbl.create 1024 in
  let rec aux = function
    | [] -> ()
    | h :: t ->
        let key = f h in
        let curr = Hashtbl.find_opt tbl key |> Option.value ~default:[] in
        ()
        ; Hashtbl.replace tbl key (h :: curr)
        ; aux t
  in
  ()
  ; aux lst
  ; Hashtbl.fold (fun k v acc -> (k, v) :: acc) tbl []
;;

n4323 · October 25, 2023, 12:52pm

Maybe this could go into a feature request for containers? (hint, hint) It does seem somewhat wasteful to unconditionally throw away all the computations of f element after grouping…

Topic		Replies	Views
JaneStreet's OCaml Core library's Hashtbl interfaces Learning core , hashtbl	16	570	September 28, 2024
A new List.map that is both stack-safe and fast Community article , performance	34	12680	February 24, 2018
Functorial vs "comparator" container interfaces? Ecosystem core , base , core_kernel	12	2343	April 16, 2020
Does this "partition" function have a name? Learning	17	1289	October 20, 2022
Standardized mutable data structures Ecosystem	34	2477	October 29, 2022

Idiomatic group_by/distinct_by/count_by

Related topics