Filtering lists in an efficient way

type a = A of int | B of int list | C | D

Now I have an a list, and want to partition it to a list list that is each list has only on kind of the union type, while preserving the order:

e.g. an input of [A 1; C; B [1;2]; A 2; C; A 3; D; B [2;3]] gives
[[A 1; A 2; A 3]; [B [1;2]; B [2;3]]; [C; C]; [D]].

I know it can be done by doing multiple List.filter, and if efficiency is the concern there’re serval ways to hand code the filtering so it would only traverse the list once like the following when the data type is known, and the number if filters are fixed to three:

   let trivial, normal, complete =
      List.fold_left (fun (t, n, c) tac ->
            let tac = mapT tac in
               match tac.auto_type with
                  AutoTrivial -> tac::t, n, c
                | AutoNormal -> t, tac::n, c
                | AutoComplete -> t, n, tac::c
      ) ([],[],[]) tactics

but I would like to know if there’s a generic way to parametrize the filtering while both the data type and number of filters is unspecified.

Or are there any data structures can memo the result of filtering a list, and supports map to sublists of arbitrary combinations of tags?

Suppose you provide two things to a function I will describe in a moment:

  1. the # of branches of the union-type (in your example, 4)
  2. a function from the type to integers 0…3 (in general, to 0…(n-1) where n is the value in #rfc1951

Then a function that allocates an array (of length #1) of lists of type a, and iterates down the list, calling function #1 on each value, and using that to drop the value into the selected slot in the array, should suffice.
At the end, you could List.rev each slot in the array, just for neatness.

1 Like

Just to elaborate a bit, there is a function in batteries included that does exactly what you are looking for :

val group : ('a -> 'a -> int) -> 'a list -> 'a list list

group cmp l returns list of groups and each group consists of elements judged equal by comparison function cmp. Groups in the resulting list appear in order given by cmp. All groups are always nonempty. group returns [] only if l is empty.

For example group cmp [f;c;b;e;d;a] can give [[a;b];[c];[d;e;f]] if following conditions are met: cmp a b = 0, cmp b c = -1, cmp c d = -1, cmp d e = 0, ...

the cmp function you need is easy to implement :

let classify a1 a2 = match (a1, a2) with
| A _, A _ | B _, B _ | C, C | D, D -> 0
| A _, _ -> -1
| _, A _ -> 1
| B _, _ -> -1
| _, B _ -> 1
| C, D -> -1
| D, C -> 1

There may be some way to factorize that a bit more. Alternatively, you could use ppx_compare to handle the cases where identifiers are not equal :

type a = A of int | B of int list | C | D [@@deriving ord]

let classify a1 a2 = match (a1, a2) with
| A _, A _ | B _, B _ | C, C | D, D -> 0
| _ -> compare_a a1 a2

I guess Base have an equivalent solution to this problem but I am not aware of it.

Actually documentation does not explicitly guarantee that the original order is preserved inside each group and the given example actually does not preserve order… This is a weird design decision to be honest (even if reverting the loop may allow for a more efficient implementation). I just tried it on my machine : classify [A 3; B []; B [8; 9]; C; A 2; A 6];;
- : a list list = [[A 3; A 2; A 6]; [B []; B [8; 9]]; [C]]

So here order is preserved.

You could add to a Map (of list elements) as you fold. You probably need to provide a function to determine the key, but as a benefit it is more general (you cluster as you see fit).

Thank you everyone, I think the group function is the solution I’m looking for. However it might also be a good idea to use array of list in my case.